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Abstract 

Background: Insertions and deletions (indels) account for more nucleotide differences between two related DNA 
sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables 
us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic 
models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel 
component of the probability of a given sequence alignment as a product of either probabilities of column-to-column 
transitions or block-wise contributions along the alignment. However, it is not a priori clear how these models are 
related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence 
along the time-axis. Moreover, currently none of these models can fully accommodate biologically realistic features, 
such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions. 

Results: Here, we theoretically dissect the ab initio calculation of the probability of a given sequence alignment 
under a genuine stochastic evolutionary model, more specifically, a general continuous-time Markov model of the 
evolution of an entire sequence via insertions and deletions. Our model is a simple extension of the general 
"substitution/insertion/deletion (SID) model". Using the operator representation of indels and the technique of 
time-dependent perturbation theory, we express the ab initio probability as a summation over all alignment-consistent 
indel histories. Exploiting the equivalence relations between different indel histories, we find a "sufficient and nearly 
necessary" set of conditions under which the probability can be factorized into the product of an overall factor and the 
contributions from regions separated by gapless columns of the alignment, thus providing a sort of generalized HMM. 
The conditions distinguish evolutionary models with factorable alignment probabilities from those without ones. The 
former category includes the "long indel" model (a space-homogeneous SID model) and the model used by Dawg, a 
genuine sequence evolution simulator. 

Conclusions: With intuitive clarity and mathematical preciseness, our theoretical formulation will help further advance 
the ab initio calculation of alignment probabilities under biologically realistic models of sequence evolution via indels. 

Keywords: Stochastic evolutionary model, Insertion/deletion (indel), Sequence alignment probability, Factorability, 
Biological realism, Power-law distribution, Rate variation, Non-equilibrium evolution 
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Background 

The evolution of DNA, RNA and protein sequences is 
driven by mutations such as base substitutions, insertions 
and deletions (indels), recombination and other genomic 
rearrangements ( e.g., [1-3]). Some recent comparative 
genomic analyses revealed that indels account for more 
base differences between closely related genomes than 
base substitutions do (e.g., [4-7]). It is therefore imperative 
to develop a method to reliably calculate the probability of 
sequence evolution via mutations including indels. Since 
the groundbreaking works by Bishop and Thompson [8] 
and by Thorne, Kishino and Felsenstein [9], many studies 
have been made to calculate the probabilities of pairwise 
alignments (PWAs) and multiple sequence alignments 
(MSAs) under probabilistic models aiming to incorporate 
the effects of indels. And the methods have greatly im¬ 
proved in terms of the computational efficiency and the 
scope of application (reviewed, e.g., in [10-12]). Most of 
these studies are based on hidden Markov models 
(HMMs) (e.g., [13]) or transducer theories (e.g., [14]). Even 
today, the studies on these methods are steadily advancing 
(e.g., [15, 16]), and it seems that their mathematical and al¬ 
gorithmic bases are about to be established. 

However, it is important to remember that these desir¬ 
able properties alone are not sufficient for a model in 
natural science to be satisfactory. In addition to having 
the mathematical (or algorithmic) soundness, a satisfac¬ 
tory model must also approximate well, or at least de¬ 
cently, the real phenomena it is intended to describe. In 
the case of an indel probabilistic model, there are two 
key elements for this requisite: one is the evolutionary 
consistency of the model, and the other is the model’s 
flexibility to accommodate various biologically realistic 
features, i.e., the biological realism, of indels. 

Let us first explain the evolutionary consistency. In 
natural sequence evolution, the probability (density) of 
an indel process must be given vertically, as a multiplica¬ 
tive accumulation of the probabilities of transitions be¬ 
tween states of an entire sequence, each from one time 
point to the next one, along the time axis (or along a 
phylogenetic tree when dealing with MSAs) (Fig. la). If a 
probabilistic model gives the probability density of each 
evolutionary process according to this natural design, we 
call it a “genuine stochastic evolutionary model”, or sim¬ 
ply an “evolutionary model” for short. And we will con¬ 
sider an indel probabilistic model as “evolutionarily 
consistent”, if its alignment probabilities can be derived 
directly from an evolutionary model, even if it does not 
appear to follow the aforementioned natural design. By 
definition, each of HMMs and transducers calculates 
(the indel component of) an alignment probability hori¬ 
zontally, as a product of either inter-column transition 
probabilities or block-wise contributions (Fig. lc). 
Therefore, it is a priori unclear whether each HMM or 


transducer is evolutionarily consistent or not, or, if it is, 
how (Fig. 1). It would be worth mentioning that some 
models were indeed derived explicitly from some sorts 
of evolutionary models (e.g., [9, 17, 18]). Unfortunately, 
all such studies in the past imposed some unnatural as¬ 
sumptions, such as the prohibition of overlapping indels 
and the restriction of deletions to single-base ones. Such 
assumptions were necessary for an alignment probability 
to be trivially factorable, at least as a product of block- 
wise contributions. Thus, they are unsatisfactory from 
the viewpoint of the second key element, i.e., the bio¬ 
logical realism. 

Regarding the biological realism, past empirical studies 
revealed some properties of real indels, in addition to 
their possibilities to affect multiple contiguous residues 
at a time and to overlap others. Among the most im¬ 
portant would be the studies that showed power-law dis¬ 
tributions of indel lengths (see, e.g., [19] and references 
therein). On the contrary, standard HMMs and trans¬ 
ducers can usually implement geometric distributions of 
indel lengths, or at best mixed geometric distributions 
(e.g., [20]), but cannot implement the power-law distri¬ 
butions themselves. But some generalized HMMs (or 
transducers) (e.g., [21, 22]) can incorporate power-law 
indel length distributions. For example, the HMM of 
Kim and Sinha [22] is quite flexible, and it can incorpor¬ 
ate the power-law distributions and also do away with 
the commonly imposed time-reversibility. As discussed 
e.g., in [21] and [23], there is no biological reason for im¬ 
posing the time reversibility, and they were usually im¬ 
posed to reduce the computational time. In this sense, 
the HMM of Kim and Sinha is two steps closer to the 
biological reality than the standard HMMs (and trans¬ 
ducers). Unfortunately, similarly to the standard HMMs 
and transducers, their HMM is not evolutionarily con¬ 
sistent and thus cannot correctly handle overlapping 
indels along the same branch, though they can handle 
overlapping indels along different branches. Another 
possibly important biologically realistic feature is the 
indel rate variation across sites (or regions) (e.g., [24]), 
due to selection and the mutational predispositions 
(caused, e.g, by the sequence or epigenetic contexts). 
Thus far, attempts to incorporate this feature have been 
rare (e.g., [25]), and most studies have handled space- 
homogeneous models, whose indel rates are homogeneous 
along the sequence. 

As far as we know, except the models implemented in 
some genuine sequence evolution simulators (e.g., [26-28]), 
there is only one class of genuine stochastic evolutionary 
models discussed thus far that is also considerably biologic¬ 
ally realistic, which are the “substitution/insertion/deletion 
(SID) models” proposed by Miklos et al. [21]. The SID 
models in general do not impose the aforementioned un¬ 
natural restrictions on indels. Moreover, the general SID 
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Fig. 1 Genuine stochastic evolutionary model vs. HMM (or transducer), a Probability density calculation via a genuine stochastic evolutionary 
model. Each sequence state is represented as an array of sites (boxes). Sites to be deleted are shaded in red or magenta. Inserted sites are 
shaded in blue or cyan. The s, and s F , respectively, denote the initial and final states. The s v (v = 1,2,3) is an intermediate state. (The 
denotes a probability, the "p[...]" denotes a probability density.) b A pairwise alignment (PWA) between the initial (/) and final (F) states resulted 
from the indel process in panel a. The C, (/ = 1,.... 10) labels the alignment column below, c Probability calculation via a HMM (or a transducer). 
It is a priori unclear whether or how the methods in panels a and c are related with each other. For clarity, residue states and substitutions were 
omitted. (Note that the equation in panel a is merely a rough expression to give a broad idea on the issue. Rigorous expressions will be given in 
Results and discussion.) Panels a and b of this figure were adapted from panels B and F of Fig. 1 of [32] 


model can accommodate any indel length distributions, 
and also some indel rate variations across sites (albeit 
through the residue state context alone). Unfortunately, 
however, we have not seen any further theoretical develop¬ 
ment of the general SID model since it was proposed. In¬ 
stead, Miklos et al. developed the “long indel” model [21], 
which is a space-homogeneous, time-reversible SID model. 
(More precisely, the insertion rate depends on the inserted 
sequence only through the product of the frequencies of its 
constituent residues. As mentioned above, the time- 
reversibility was introduced just for computational con¬ 
venience, and it could be dispensed with if desired.) And 
they gave a verbal justification that the probability of a 
PWA under the “long indel” model can be calculated via a 
generalized HMM, as a product of contributions from 
“chop-zones” delimited by gapless columns. In the present 
viewpoint, this is the most satisfactory HMM that we know 
was used for actual sequence analyses, because it satisfies 
both the evolutionary consistency and the biological real¬ 
ism to some degree. Nevertheless, their justification, al¬ 
though plausible, has two problems. First, it is unclear 
exactly how their HMM is related to the ab initio probabil¬ 
ities of evolutionary (especially indel) processes under their 
evolutionary model. And second, their justification takes 
advantage of the space-homogeneity of the model, which 


makes it unclear how their HMM can be extended to be 
space-heterogeneous while keeping the evolutionary 
consistency. To solve these two problems, we need at least 
to get back to their origin, i.e., the general SID model. It 
seems, however, that this model has never been theoretic¬ 
ally dissected thus far, possibly due to the lack of mathem¬ 
atical or conceptual tools to handle it easily. 

In this study, we examine a general continuous-time 
Markov model of the evolution of an entire sequence via 
insertions, deletions and substitutions. This model could 
be regarded as an extension of the general SID model, in 
the sense that it allows explicit (or inherent) rate vari¬ 
ation across sites, not only due to residue state contexts. 
Such rate variation could be regarded as the effect of, 
e.g., the epigenetic context and/or the context within the 
3D structure of the protein product {e.g., [25]). 1 To the¬ 
oretically dissect the ab initio calculation of alignment 
probabilities under this model, we introduce some useful 
tools. Among the most important would be the operator 
representation of mutations, namely, insertions, deletions 
and substitutions. This enabled us to shift our focus from 
the trajectory of sequence states, which played a central 
role in [21], to the history of mutations (especially indels). 
Moreover, the operator representation enabled to algebra¬ 
ically define the equivalence relationships between two 
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different series of indels. They, in cooperation with 
the focus shift, enabled to define the local history set 
(LHS) equivalence classes. These equivalence classes 
play an essential role when deriving the “sufficient 
and nearly necessary” set of conditions under which 
alignment probabilities are indeed factorable, thus 
providing a sort of generalized HMM. We also adapt 
techniques from the time-dependent perturbation ex¬ 
pansion in quantum mechanics [29, 30], expanding 
each alignment probability into a summation over 
contributing mutational histories with different num¬ 
bers of indels. It should be noted, however, that we 
formally deal with all terms in the expansion.” Thus, 
at least formally, the probabilities we deal with are 
exact solutions of the model’s defining equation. For 
clarity, we will focus on insertions/deletions in the 
bulk of the manuscript. However, we can also incorp¬ 
orate substitutions; see, e.g., [31] for more details. 

This paper describes the backbone of our study (more 
extensively recorded in an unpublished paper [32]) to 
give the theoretical basis of our ab initio probability cal¬ 
culation under the general continuous-time Markov 
model of indels. Peripheral topics surrounding the study 
can be found in [32].° Throughout the paper, we sup¬ 
pose that each probability is calculated under a given 
evolutionary model setting, including the phylogenetic 
tree of the sequences. In section R1 of Results and dis¬ 
cussion, we briefly review the most general form of the 
SID model [21]. Then, in section R2, we introduce two 
important tools, namely, the ancestry index and the op¬ 
erator representation of mutations including indels. 
Using the results of sections R1 and R2, we define our 
general continuous-time Markov model in section R3, 
and formally give the general solution to its defining 
equation in terms of the operator representation. In sec¬ 
tion R4, we formally express the ab initio probability of 
a given PWA in a perturbation expansion. Then, using 
the concept of the LHS equivalence classes defined in 
section R5, we derive in section R6 the conditions under 
which the PWA probability is factorable. In section R7, 
the derivation is extended to the probability of a given 
MSA. In section S8, some examples are given to illus¬ 
trate models with factorable and non-factorable align¬ 
ment probabilities. The former category includes the 
indel evolutionary model of Dawg [26] and the “long 
indel” model [21], among others. In section R9, we 
discuss the merits, possible uses and extensions, as 
well as some outstanding issues, of the results in this 
study. In Table 1, we summarize the key concepts 
and results of this paper, mainly for those who want 
its gist quickly. Likewise, Table SI (in Additional file 1) 
summarizes mathematical symbols used commonly in this 
paper, to facilitate the readers’ cruise through the equa¬ 
tions. Supplementary methods (in Additional file 1) and 


Supplementary appendix (in Additional file 2) give de¬ 
tailed derivations of some important results. The former is 
more essential and accessible to a wider audience; the lat¬ 
ter is for those who are interested in further mathematical 
details. 

We end this section with two notes. First, in this 
paper, the term “an evolutionary (or indel) process” 
means a series of successive mutation (or indel) events 
with both the order and the specific timing specified, 
and the term “an evolutionary (or indel) history” means 
a series of successive events with only the order speci¬ 
fied. This usage should conform to the common practice 
in this field. Second, we will describe the results in 
the bra-ket notation, similar to that in quantum me¬ 
chanics [29, 33]. However, those who are unfamiliar 
with the notation need not worry about it. Our for¬ 
mulation via the bra-ket notation can be proven to be 
equivalent to the standard formulation of the 
continuous-time Markov model via the vector-matrix 
notation. (We refer the interested readers to Supple¬ 
mentary appendix SA-1 in Additional file 2.) There¬ 
fore, if desired, the symbols of a bra ((x|), a ket (|y)), 
and an operator (O) could be regarded simply as con¬ 
venient reminders of a row vector, a column vector, 
and a matrix, respectively. 

Results and discussion 

The key concepts and results proposed/obtained in this 
paper are summarized in Table 1. Readers can use the 
table to quickly grasp an overview of this paper, as well 
as to easily locate what they look for. Also, most math¬ 
ematical symbols are briefly explained in Table SI in 
Additional file 1. 

R1. Brief review of general SID model 

Miklos et al. [21] proposed a class of evolutionary 
models, which they called the “substitution/insertion/ 
deletion (SID) models”. They are continuous-time 
Markov models defined on the space of strings {i.e., 
sequences) of any lengths, each of which consists of 
letters (i.e., residues, such as bases or amino acids) from a 
given alphabet (denoted as fl here). Following [21], their 
state space will be denoted as: D*= U^ =0 f2 L , whose com¬ 
ponent, D l , is the space of all sequences of length L. If 
desired, a sequence state, s £ O l , could be represented as: 
s = [cji, o> 2 , ..., tot] (with (o x sD for x = l,2,...,L) (see 
Fig. 2a). In this model, mutations are defined as transitions 
from a sequence state to another, and their instantaneous 
rates can be given via the following “rate grammar” they 
proposed: 

Substitution: s = s £ cds s g t_ Sr(0 ' Sr . (Rl.l) 

Insertion: s = SlSr , = (RL2) 

Deletion: 5 = = Sl s„. (Rl-3) 
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Table 1 Key concepts and results in this paper 

Concept/result Description 

Ancestry index An ancestry index is assigned to each site. Sharing of an 

ancestry index among sites indicates the sites' mutual 
homology. As a fringe benefit, the indices enable the 
mutation rates to vary across regions (or sites) beyond 
the mere dependence on the residue state of the 
sequence. 

Operator representation of mutations This enables the intuitively clear and yet mathematically 

precise description of mutations, especially insertions/ 
deletions, on sequence states. This is a core tool in 
our ab initio theoretical formulation of the genuine 
stochastic evolutionary model. 

An operator version of the rate matrix, which specifies 
the rates of the instantaneous transitions between the 
states in our evolutionary model. 

In other words, the rate operator describes the 
instantaneous stochastic effects of single mutations 
on a given sequence state. 

An operator version of the finite-time transition matrix, 
each element of which gives the probability of 
transition from a state to another after a finite time-lapse. 
This results from the cumulative effects of the rate 
operator during a finite time-interval. 

Ist-order time differential equations (forward and 
backward) that define our indel evolutionary model. 
They are operator versions of the standard defining 
equations of a continuous-time Markov model. 

Two integral equations (forward and backward) that are 
equivalent to the aforementioned differential equations 
defining our indel evolutionary model. They play an 
essential role when deriving the perturbation expansion 
of the finite-time transition operator. 

Perturbation expansion (transition operator) The perturbation expansion of the finite-time transition 

operator. It was derived in an intuitively clear yet 
mathematically precise manner, by using the 
aforementioned defining integral equations. 

Perturbation expansion (ab initio PWA probability) The perturbation expansion of the ab initio probability 

of a given PWA, conditioned on the ancestral sequence 
state, under a given model setting. 

An equivalence relation between the products of two 
indel operators each. The relations play key roles when 
defining LHS equivalence classes. 

An equivalence class consisting of global indel histories 
that share all local history components. The classes play 
an essential role when proving the factorability of a 
given PWA probability. 

We proved that, under conditions (i) and (ii) (below 
Eq. (R6.4)), the ab initio probability of a given PWA is 
factorable into the product of an overall factor and 
contributions from local PWAs. 

Perturbation expansion (ab initio MSA probability) The "perturbation expansion" of the ab initio probability 

of a given MSA, under a given model setting including 
a given phylogenetic tree. 

Factorability (ab initio MSA probability) We proved that, under conditions (i), (ii) (below 

Eq. (R6.4) and (iii) (Eq. (R7.8)), the ab initio probability 
of a given MSA is factorable into the product of an 
overall factor and contributions from local MSAs. 

Totally space-homogeneous model Such a model gives factorable PWA probabilities, 

because the exit rate is an affine function of the 
sequence length (regardless of whether indel rates are 
time-dependent or not). The indel model of Dawg [26] 
and the "long indel" model [21] belong to this class. 


Binary equivalence relation 


Local-history-set (LHS) equivalence class 


Factorability (ab initio PWA probability) 


Rate operator 


Finite-time transition operator 


Defining equations (differential) 


Defining equations (integral) 


Main location 

Section R2 (1st and 2nd paragraphs), 
Fig. 2 


Section R2 (3rd paragraph), 

Fig. 3 


Section R3, 

Eqs. (R3.1-R3.9) (full mutational model), 

Eqs. (R3.2,R3.6,R3.11 R3.15) (indel model) 


Section R3, 

Eq. (R3.17), Eq. (R3.18) 


Section R3, 

Eqs. (R3.19.R3.21) (forward), 
Eqs. (R3.20.R3.21) (backward) 


Section R4, 

Eq. (R4.4) (forward), 
Eq. (R4.5) (backward) 


Section R4, Eqs. (R4.6,R4.7) 


Section R4, Eq. (R4.8) or Eq. (R4.9) 


Section R5, Eqs. (R5.2a-R5.2d) 


Section R5, 

below Eq. (R5.4), 

(e.g., Fig. 5) 


Section R6, Eqs. (R6.7,R6.8), 

(see also Eqs. (R6.2,R6.3,R6.4)) 


Section R7, Eqs. (R7.2,R7.3,R7.4) 


Section R7, 

Eq. (R7.9) 


Subsection R8-1, Eqs. (R8-1.1 ,R8-1.2), 

Eqs. (R8-1,3,R8-1.4) 
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Table 1 Key concepts and results in this paper (Continued) 


Equivalence (with caveat) of the "chop-zone" We showed that the "chop-zone” method in [21], Subsection R8-1, Supplementary 

method and our ab initio method adapted to calculate the probability of a given LHS appendix SA-3 

equivalence class, is equivalent to our ab initio method, 
at least if the indel model is spatiotemporally 
homogeneous. 


Model with simple insertion rate variation 


Space-homogenous model flanked by 
essential sites 


If the deletion rates are space-homogeneous and the Subsection R8-1, Eq. (R8-1.5) 
insertion rates depend only on the insertions' flanking 
sites, the PWA probabilities are still factorable. 

This kind of model is a simplest example of the indel Subsection R8-2, 

model whose ab initio PWA probabilities are Eqs. (R8-2.1,R8-2.3) 

non-factorable 


Degree of non-factorability 


Space-heterogeneous model with factorable 
PWA probability 


The "difference of exit-rate differences” (Eq. (R8-2.4)) Subsection R8-2, Eq. (R8-2.4) 
could measure the "degree of non-factorability." 

We found that a class of indel models with Subsection R8-3, Eqs. (R8-3.1 ,R8-3.2), 

rate-heterogeneity across regions (Eqs. (R8-3.1 ,R8-3.2)) Eqs. (R8-3.3,R8-3.4,R8-3.5), Figure S3 
have partially factorable PWA probabilities. 


NOTE: Especially important things are in boldface 


Here, with m- S,IandD denote the rates of 

the substitution, the insertion, and the deletion, re¬ 
spectively, possibly depending on the arguments in 
the parentheses. In each of the above rules, s and s', 
respectively, denote the sequence states before and 
after the mutation. The symbols s L and s R denote the 
subsequences flanking the mutated portion from the 
left and from the right, respectively. 5 These SID 
models equipped with this “rate grammar” are genu¬ 
ine stochastic evolutionary models, and thus do not 
usually impose unnatural restrictions on the muta¬ 
tions (except possibly through restrictions on muta¬ 
tion rates). And the most general SID model can 
accommodate quite general mutation rates, including 
indel length distributions, by allowing their depend¬ 
ence on the sequence states before and after the mu¬ 
tation. 6 As far as we know, however, this most 
general SID model was not theoretically examined 
further (at least thus far), maybe because adequate 
mathematical or conceptual tools were not devised 
and because of some other reasons (mentioned 
below). In the following sections, we will provide such 
tools, which in turn will help theoretically dissect an 
extended version of the most general SID model. 


R2. Ancestry indices and operator representation of 
mutations 

First, we slightly extend the framework of the SID 
models, by assigning an ancestry to each site, which 
is a unit position in the sequence that accommo¬ 
dates a single residue. Hereafter, we will consider 
that it is the sites, instead of the residues, that are 
inserted/deleted. For example, the above example se¬ 
quence state, s = [o>i, o) 2 , •••, (UtKefT 6 ), can be extended as: 

s= [(ui,wi),(u 2 ,«2),-.,(yi,wz.)]( e (^ xflf) (Fig- 2b). 

Here, v x (e 7) is the ancestry index assigned to the x-th site 
of the sequence (with x = 1,2, Alternatively, the 

extended sequence state could also be represented 

as:s= , 7 where u= [iq,i >2 is an array 

of ancestry indices assigned to the sites, and cj= 
[wi, cii] (eO 1 ) is an array of residue states that 
fill in the sites. (Note that co corresponds to the se¬ 
quence state (s) in the SID models (in section Rl)). 
The ancestry indices follow a number of rules: (i) dif¬ 
ferent sites in the same sequence always have differ¬ 
ent ancestry indices; (ii) the ancestry index of a site 
remain unchanged as long as the site exists; and (iii) 


a State in SID models 


b Extended state 


A 

c Basic state 


s (= ffl) = 


A 

C 

G 

A 

T 

G 

C 
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2 

3 
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6 

7 

A 

C 

G 

A 

T 

G 

C 


si=v) = 


1 

2 

3 

4 

5 

6 

7 


Fig. 2 Sequence states, a A sequence state in the SID models [21], Each site (cell) is assigned a residue state (A, T, G or C). b The corresponding 
extended sequence state in our evolutionary model. Each site is assigned an ancestry index (number) in addition to a residue state, c The 
corresponding basic sequence state. Each site is assigned an ancestry index alone. The w represents the set of residue states assigned to all sites. 
The ~u represents the set of ancestry indices assigned to all sites. (Note that the identical symbols (s's) used in panels a and c represent different 
types of states (of the same sequence)) 
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every time when an insertion takes place, new ances¬ 
try indices are assigned to the newly inserted sites. 
Other than these rules, the assignment of the indices 
is arbitrary. Especially, their values themselves are not 
so important. The most essential thing is whether 
two sites of different sequences share the same ances¬ 
try index or not; if so, the sites are mutually homolo¬ 
gous (actually orthologous unless duplications are 
considered). Another important thing is the spatial re¬ 
lationship of the site having each ancestry with other 
sites, especially preserved ancestral sites (PASs, ex¬ 
plained shortly). For the space of ancestry indices, Y, 
we will tentatively use the set of all positive integers 
(A?i={l, 2,3,...}), although there should be a more 
appropriate mathematical entity. 8 Because of the 
rules imposed above, the space of the extended se¬ 
quence states (denoted as S H in [31]) is included in 
but never equal to { Y x O} = U“ = 0 { Yx C1} L . 

We devised the ancestry indices to facilitate the de¬ 
scription of indel histories by keeping track of the 
evolutionary course of each site. For example, consider 
that a sequence whose initial state had the ancestry indices 
V = [1,2, 3,4,5,6,7] evolved into the final state with 

~v = [1, 5, 6, 8, 9, A, 7]. (Here the “A” is an abbreviation 

of 10.) Then, we can immediately infer what happened 
during its evolution by aligning the two sequences (repre¬ 
sented by the arrays of ancestry indices): 


v 123456-7. 

/ 

v 1-5689A7 

F 


(R2.1) 


This alignment tells that the sites with ancestries 2, 
3 and 4 were deleted and that the sites with ances¬ 
tries 8, 9 and A were inserted. We can also see that 
the sites with 1, 5, 6 and 7 were preserved during the 
evolution. We will henceforth refer to such sites as 
“preserved ancestral sites (PASs)”. The PASs indicate 
that no indels occurred at or through the sites during 
the evolution under consideration. Thus, they can be 
used to narrow down the possible indel histories that 
might have resulted in the pairwise alignment (PWA) 
(as argued, e.g., in [21]). A fringe benefit of the ances¬ 
try indices is that they enable the mutation rates to 
vary beyond the dependence on the residue states 
(section R3). Hereafter, we refer to an array of ances¬ 
try indices (like d) as a “basic sequence state” (abbre¬ 
viated as a “sequence state” or a “basic state”), which 
is a backbone to be fleshed out by the residue states 
(like 6j) to give the extended sequence state (like s ). 
Hereafter, the basic sequence state will often be de¬ 
noted, e.g., as s (Fig. 2c). (Henceforth, symbols like s will 


never denote a residue state (like oj ), which was a se¬ 
quence state in section Rl). And S n (cY* = U ” = 0 Y L ) de¬ 
notes the space of the basic states. 

Another, probably more important, tool we introduce 
here is the operator representation of mutations. When 
considering the evolutionary processes (or histories), we 
symbolically represent each (extended) sequence state as 
a bra-vector, like (s/|, which could be regarded as an ab¬ 
stract extension of a row vector in the normal represen¬ 
tation of the continuous-time Markov model. Then, 
each mutation of a sequence can be represented as a lin¬ 
ear operator (regarded as an abstract extension of a 
matrix) acting on a bra-vector (Fig. 3). Operator M$ 
(x, oiM-ca') denotes the substitution of the residue to 
oj'(xoj) if it was co at the x th site (or the null action 
otherwise) (Fig. 3a). Operator Mj(x,l) denotes the inser¬ 
tion of / sites between the x-th and (x + l)-th sites 
(Fig. 3b). The insertion operator alone does not deter¬ 
mine the residue states of the inserted sites. It is the job 

of the “fill-in” operator, F (x, Scj / [/] ^ , which fills in the l 

inserted sites with a new array of residues, Sco r \l] (etV) ■ 
(This “division of labor” facilitates the decoupling of the 
substitution component and the indel component of an 
alignment probability. See Appendix A1 of [31] for more 
details). Finally, operator Md(xb,Xe) denotes the dele¬ 
tion of the subsequence between (and including) the 
x B -th. and x £ -th sites in the sequence immediately be¬ 
fore the event (Fig. 3c). The action of multiple succes¬ 
sive mutation events can be expressed as a product of 
mutation operators on an initial (extended) sequence 
state. For example, the indel history illustrated in 
panels a and b of Fig. 4 can be represented as a series 
of indel events, [M d ( 3, 3), Af/(5, 2 ),M d ( 2, 3), Af/(5,1)], 
on the initial basic state S/ (given above). Then, the 
final result of this indel history is expressed as: 
(s I \M D (3,3)M I (5,2)M D (2,3)M I (5,l) . Figure 4c shows 
the MSA among the initial, intermediate and final se¬ 
quence states. Figure 4d shows the resulting PWA 
between the initial and final sequence states. 

These new tools, the ancestry indices and the operator 
representation of mutations, will play essential roles in 
our theoretical development described below. In the SID 
models [21], each evolutionary process was expressed as 
a (time-recorded) trajectory of sequence states, each of 
which was represented by an array of residues (without 
ancestry assignments). In consequence, an instantaneous 
transition from a state to the next state was often 
expressed as a summation of multiple possible muta¬ 
tions. (For example, the transition from <u= [A, A] to cj / 
= [A] could result from either Md( 1,1) or Mo(2,2)). 
Ancestry indices help avoid such ambiguous channels by 
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Fig. 3 Operator representation of mutations, a A substitution operator, M s (5, Tw-C). The residues before and after the substitution are in boldface in blue 
and red, respectively, b An insertion operator, M/(6, 3), and a fill-in operator, F(6, [7“, A, C]). The inserted sites are shaded in cyan. (Note that "A" at the top 
of the rightmost inserted column means the ancestry index of 10, not the residue state of A), c A deletion operator, M 0 ( 2, 4). The sites to be deleted are 
shaded in magenta. In this figure, the extended sequence states were used for illustration. The bra-vector below each array denotes the state. The extended 
state, s", is identical to that in Fig. 2b. Each vertical arrow indicates the action of the mutation operator beside it. Note that the first arguments of all 
operators and the second argument of the deletion operator specify positions along the sequence, and not ancestries (specified at the top of the sites) 


uniquely defining each instantaneous state-to-state tran¬ 
sition as an action of a single mutation. (In the above ex¬ 
ample, the former causes the transition from v= [1,2] 
to y ' = [2], and the latter yields the transition from u 


toy ' = [1]). This, in conjunction with the operator rep- 

resentation of mutations, enables us to shift the focus 
from the trajectory of sequence states to the history of 
mutations, especially indels. This shift of focus, as well 
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Fig. 4 Example indel history and resulting alignments, a An example indel history in terms of the bra-vectors of sequence states and indel 
operators, b The graphical illustration of the history using basic sequence states. Each sequence state in panel a is horizontally aligned with its 
graphical representation in panel b. c The resulting MSA among the sequence states that the indel history went through, d The resulting PWA between 
the initial and final sequences. In both c and d, the bold italicized characters in the leftmost column are the suffixes indicating the sequence states in panel 
a. In panels b, c, and d, the number in each site (cell) represents its ancestry, but not necessarily its position along the sequence. The 'A' in the final 

sequence abbreviates 10. The same shading scheme as in Fig. 1 is used. The figure was adapted from Fig. 1 of [32] 
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as the “equivalence relations” between the products of 
operators (section R5), facilitates the examination of the 
factorability of alignment probabilities, as we will see in 
sections R4-R6. 


rate operator. Each of the three components can be fur¬ 
ther decomposed as: 

Q m (t) = QZ{t) + Q%(t) (m = S, I and D). (R3.2) 


R3. Instantaneous transition (or rate) operator and 
finite-time transition operator 

Now we are ready to define our evolutionary model, i.e., 
the general continuous-time Markov model to describe 
the evolution of an entire sequence along a time axis. If 
desired, this could be done by extending the rate gram¬ 
mar, Eqs. (Rl.1,2,3), so that each mutation rate depend 
on the entire extended sequence states (i.e., not only on 
the residue states) immediately before and after the mu¬ 
tation. (We will also introduce the explicit time depend¬ 
ence. This may allow us to incorporate the effects of 
changes in the physiology, genomic contexts, external 
environments, etc. (See also section 2.4 of [32].)) How¬ 
ever, to define the model more neatly, we will paramet¬ 
rize the mutation rates in coordination with the 
mutation operators defined in section R2. Although the 
parametrization is different from (the extended version 
of) Eqs. (Rl.1,2,3), the two sets of mutation rates are 
equivalent. (To remember these differences in the 
parametrization, we will use the symbol r m (...) (m = S,I 
andD) instead of p m (...).) Let s be the extended se¬ 
quence state immediately before the mutation. And 
let t be the time at which the mutation occurred. 
Then, r$(x, u>-+cj l ;s, t) is the rate of the substitution, 
Ms(x, . It must be zero unless cj is at the x-th 

site of s. rj(x, /, 8a / [/];s, t) is the rate of the inser¬ 
tion, Mj(x,l ) (accompanied by P^x, Sco r [l \^). And rp 

( XB,XE-s,t ) is the rate of the deletion, Md{xb,Xe) ■ 
Using these mutation rates that accompany the mu¬ 
tation operators, we can define our evolutionary 
model in a manner closer to the standard, by defin¬ 
ing the instantaneous transition rate operator (or the 
“rate operator” for short), which is an analog of the 
instantaneous transition rate matrix in a continuous¬ 
time Markov model. Because the state space we are 
working in, S 11 , is essentially infinite, we cannot give 
the explicit matrix expression of the rate operator 
on the entire state space. Nevertheless, the rate op¬ 
erator can be defined if we give its action on every 
state inS /; . Let Q SID (t) denote our rate operator (at 
time t). It is convenient to decompose it as follows: 

Q SID (t) = Q S (t ) + Q J (t) + Q°(t), (R3.1) 


where Q m (t) with m = S,IandD, respectively, are the 
substitution, insertion, and deletion components of the 


Here, the “mutation part”, Q_"j(T) , describes the 
transitions to different states via mutations of type 
m(=S,IorD). And the “exit rate part”, Q™(f), attenuates 
the state retention probability at the exit rate, R™(s, t), 
which is determined by the type m mutations. It guaran¬ 
tees that the state probabilities sum up to unity at any 
time. Specifically, the mutation parts are defined by the 
following actions on every state: 


(* lOwW-EaliE “' En ' 

CO #<O x (S) 

x(s| M s {x,u x (s)»J) 


r s (x,a) x (s)^co l -s 7 t) 


(R3.3) 


<*i QU^EiELE^n' 

x (i| Mj(x, l)F (x, 8a f {l]j 


rAx, l, 8a t 


(R3.4) 


r D (x B x E \s,t) 

x (s\M d (x b x e ) ■ 

(R3.5) 

Here, L(s) denotes the length (i.e., the number of sites) 
of s, and ai x (s) denotes the residue at the x-th site of s. Fig¬ 
ure 3 exemplifies the states on the right hand sides of Eqs. 
(R3.3-3.5). The terms with x = 0 and with x = L(s ) in Eq. 
(R3.4) represent insertions at the left and right ends, re¬ 
spectively, of the sequence. How to deal with such inser¬ 
tions varies depending on various factors including the 
model setting, and can be implemented by adjusting the 
insertion rates accordingly (see, e.g., [21, 26]). The terms 
with x B < 1 or Xe > L(s) in Eq. (R3.5) represent the dele¬ 
tions of the subsequences sticking out of the subject se¬ 
quence. These terms were included because the subject 
sequence is regarded as embedded in a “chromosome” 
with a virtually infinite length (e.g., [21, 26]). 9 The exit rate 
parts are defined in nearly the same form: 


<«(^E i(i) E + ” 

\ i \ j Z—/x fi =-ooZ— sx F =, 


’XE=max{l,XB} 


(s|Q™(t) =-R™(i, t)(sj (m = S,IandD ). (R3.6) 


They differ only in the exit rates: 

= E!=iE^n,. r s (x,(O x (s)»(o’-,s,t ), (R3.7) 

(O *(O x {s) 
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R‘ x (s, t) = E^ELE^'m^(*-*- S “ ’ 


(R3.8) 


R%(s,t) = E I(S> E + ti x r D (x B x E s,t). 

^ Z-^xb=~ oq < ■JXE=rnax{l,XB} v 


(R3.9) 


These equations, Eqs. (R3.1-R3.9), cooperatively define 
the instantaneous transition rate operator, Q SID (t), and 
thus define our evolutionary model. If desired, Q SID (t) 
could be decomposed as: 

Q SID (t) = Q s ™(t) + Q s x m (t). (R3.10) 

Here Q s ™(t)=Q s M (t) + Q ! M (t) + Q^(t) is the collec¬ 
tion of all mutational transition terms, and 
Q X IJ (t)=Q x (t) + Qx(t) + Qx(t) the entire exit rate 
part, which attenuates the state retention probability at 
the total exit rate, R% D (s, t)=R s x (s. t ) + Rx(s, t ) + R x {s, t ). 
Actually, the total mutational part, Qjtf’it), is equivalent 
to (the extended version of) the “instantaneous rate 
matrix” of the general SID model (defined by Eq. (1) in 
[21]), although the two expressions appear quite different 
from each other. The difference mainly stems from the 
parametrization and the state representation. We use our 
parametrization because we believe it to clarify the mean¬ 
ing of each term in the rate operator. And, in this section, 

the sequence state after the mutation (say, (s |) was 
represented as the result of the mutation operator 
(say, acting on the state before the mutation 

i 

(say, (i;|), like (s \ = (s\M m (...). This is legitimate be¬ 
cause a mutation transfers a subject state uniquely to 
another state. This state representation will facilitate 
the unfolding of our theory below. 

Thus far, we included substitutions (and residue states) 
mainly in order to discuss the relationship between our 
evolutionary model and the general SID model. How¬ 
ever, our main interest here is in calculating the 
probability of the skeleton of a sequence alignment, com¬ 
posed only of the sites and gaps, which will then be filled 
in with residues to form a full alignment. Because such a 
skeleton can be created only through an evolutionary 
process of indels, we will hereafter omit the description 
of substitutions and the resulting changes in the residue 
state (cj). (This includes omitting the “fill-in” operators, 
P(x, Su> '[/])’s.) (Henceforth, the alignment skeleton will 

be called the “alignment” for short.) But we will retain 
the basic sequence state consisting of ancestry indices 
(s=v). The rate heterogeneity will be realized only 
through the ancestry dependence. This would be al¬ 
most sufficient for representing the dependence of 


the rates on, e.g., the epigenetic or 3D structural con¬ 
text ( e.g ., [25]). And this may also be able to approxi¬ 
mate the dependence on some residue state contexts, 
especially in highly conserved regions. Now, the rate 
operator defined by Eqs. (R3.1-R3.9) is reduced as fol¬ 
lows. (The reduced total rate operator will be denoted 
as Q ID (t)). 

Q ID (t) = Q J (f) + Q D (t), (R3.ll) 

(*iqmU) = EiitE, /;s ’ *)(si^/(*. o, 

(R3.12) 

<s|QmW = EE~E«E*{ i' XB }MxB,XE;s,t)(s\M D (x B , XE ), 

(R3.13) 

t) = EEEL^ ^ s ’ *)’ (R3.14) 

^t) = E^LE*E- { Mn rD( ^’ X£;S)i) - 

(R3.15) 

Equations (R3.2,R3.6) remain unchanged except the 
exclusion of m = S and the replacement of s by s. In 
Eqs. (R3.12,R3.14), the insertion rates could be related 
to those in Eqs. (R3.4,R3.8) by the equation: 

n(x, /; s, t)=J2 s Jmn‘ ri ( X, ll ; s ’ > ( R3 -16) 

where the o -dependence of the right hand side was 
omitted. 

Now we can calculate the operator that gives the 
finite-time transition probabilities between states. We 
will call it the “finite-time transition operator”. Let P ID 
(t. t' \ be such an operator describing the state transition 
via indels alone from an initial time, t, to a final time, 
t' (> t). Operator P ID (t. t') is defined to give the fol¬ 
lowing equations: 

(s\P ID (t,t')\s) =p[(s',tl)\(s,t)\for v (s,s')e(S") 2 . 

(R3.17) 

On the right hand side, P[{s , t')\{s, t)] is the probability 
that the sequence is in state s' at time t' conditioned on 
that it was in state s at time t. On the left hand side, |s') 
denotes a ket-vector (an abstract extension of a column 
vector), whose exclusive role here is to give “inner- 
products” with bra-vectors: (s| s') = 1 (if s = s'), =0 
(otherwise). From the evolutionary principle that our 
model must satisfy, or equivalently, from the funda¬ 
mental properties (such as the Chapman-Kolmogorov 
equation) of the continuous-time Markov model, the 
finite-time transition operator must be given by the 


Ezawa BMC Bioinformatics (2016) 17:304 


Page 11 of 25 


multiplicative accumulation of the effects of the rate 
operators along the time axis. Thus, P ID (t,t') is for¬ 
mally calculated as: 



(R3.18) 


In the middle of the equation, / is the identity operator 
(i.e., (s|/=(s| for every s g S n ), and On 

the rightmost side, T{...} denotes the meta-operator that 
rearranges the operators in each operator product term 
in the temporal order so that the earliest operator comes 
leftmost. Another way to give P ID (t,t') is through the 
first-order time-differential equation. Again, from the 
fundamental properties of the continuous-time Markov 
model, or equivalently, from Eq. (R3.18), we can show 
that the operator satisfies the following differential 
equations: 

^P ID M =P ID {t 1 t')Q ID (t'), (R3.19) 

^-P ID (t, 0 = -Q ID (t)P ,D (t, t'). (R3.20) 

Equation (R3.19) is the “forward equation”, and Eq. 
(R3.20) is the “backward equation”. And the evolutionary 
principle naturally includes the following equation: 

P ID (t,t) =1 for ^ts[t h t F l (R3.21) 

where [t It t F ] is the time interval in which the model 
is defined. This equation could be used as the initial 
condition for each of Eqs. (R3.19,R3.20). In the next 
section, we will obtain the solutions for Eqs. (R3.19, 
R3.20, R3.21) in a more tractable form than the defin¬ 
ing solution, Eq. (R3.18). 

R4. Perturbation expansion of finite-time transition operator 
and pairwise alignment probability: brief description 

In time-dependent perturbation theory of quantum me¬ 
chanics ( e.g ., [29, 30]), the instantaneous time evolution 
operator (Hamiltonian H(tj) is considered as a sum of 
two operators, H{t) = Ho{t) + V’(t), and the time evo¬ 
lution of the system is described as if the system mostly 
evolves according to the well-solvable instantaneous 
time-evolution operator ( H 0 {t )) and is occasionally 
perturbed by the “interaction” operator ( V ( t )). We 
adapt such a technique of time-dependent perturb¬ 
ation expansion to our evolutionary model. Here, we 
briefly describe the results. For their detailed 


derivations, see Supplementary methods SM-1 in 
Additional file 1. We first re-express our rate oper¬ 
ator as: 

Q®(f) = Q®(t) + Q®(f). (R4.1) 

Here Q^(t)=Q x (t) + Q x (t) describes the mutation- 
free evolution, and Q®(t) = QmW + Q^(£) describes the 
single-mutation transition between states. From the re¬ 
duced form of Eq. (R3.6), we get: 

(s|Q®(f)=-^M(s|, (R4.2) 

with R I x (s,t)=R I x (s } t) + R%(s,t). (R4.3) 

Using Eq. (R4.1), the forward equation (Eq. (R3.19)) 
accompanied by the initial condition (Eq. (R3.21)) 
can be shown to be equivalent to a crucial integral 
equation: 

P ID (t, t') = Po D (f, t') + £ dTP ID (t , r)Q®(r)Po D (r, t'). 

(R4.4) 

. Similarly, 

the backward equation (Eq. (R3.20)) accompanied by 
Eq. (R3.21) is equivalent to another crucial integral 
equation: 

P ID (t, t ') = If(t, t') + J‘ dTpf(t, t)Q™(t)P id (t, t') . 

(R4.5) 

Now, to formally solve Eq. (R4.4), we assume that the so¬ 
lution can be expanded as: P ID (t , t') = ^ n _ q P^n) {p 0 > 
where P®,^ (f, t 1 ) is the collection of terms containing N 
indel operators each. Substituting this expansion into 
Eq. (R4.4) and performing some formal calculations, 
we get the final form of the ah initio solution we 
desire: 

(so\p ID (ti,t F ) = X! 

N=0 [Afj, M2,- -,M n ]gH ,d (N-So) 

P[( [Mi,M 2 ,...,Mm) ,[ti, tf]) |(so, tj)\(so\M iM 2 -■ -Mx. 

(R4.6) 

Here, H ID (N-,so) denotes the space of all possible his¬ 
tories of N indels each beginning with the sequence 
state, s 0 - And 


Here, P^it', t")=T< expl / drQ ®(r) 
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P[M 1 ,M 2 , Afjv], [ti, IfDK-So, ti))\ 


I -J dr r --rfr w ^]^[^ i r( J W v ;5 v _i,rv)j 


ti=Ta<T\<-• -<Tn <Tjsi+i=tf 


exp< - 


^ rn+i ) 

~Yj dTB${s v ,T)\ j| (; 


!s»I=(s^i|A v |v= 1 ,...,AT} 

(R4.7) 


is the probability that an N -event indel history, [Mi, 
M 2 , •••, M n ] (with M v (v= 1,2, ...,N) being the v-th 
event), occurred during time interval [tj, tf , given an 
initial sequence state (s 0 ) at time f 7 . The rate, r(M v ; 
s v _i,r v ), is rfx, I; s v _ i, t v ) if M v = M I (x,l), and it is 
r D (x B ,XEiS v _i,T v ) if M v = M d (xb,Xe). It should be 
noted that H ,D (N = 0; s 0 )={(s 0 , [])} consists only of the his¬ 
tory with zero indel, [], whose conditional probability is: 


([],[£/, tf])|(so, £/)] = ex p{~^ drRfiso, r) 


Eq. (R4.6) 


supplemented with Eq. (R4.7) is also the solution of 
Eq. (R4.5). (Mathematically, Eq. (R4.7) is a multiple¬ 
time integral over all possible timing, whose integrand 
is the probability density of an evolutionary process 
of N indels with particular timing, (zq, r 2 ,.... t n )). 

Equation (R4.6) states that the finite-time transition oper¬ 
ator (acting on (s 0 |) is the collection of the effects of all pos¬ 
sible indel histories starting with s 0 , each weighted by its 
probability (Eq. (R4.7)). Thus, it mathematically underpins 
Gillespies [34] famous stochastic simulation algorithm, 
which provides the basis of genuine molecular evolution 
simulators ( e.g., [26-28]). Our derivation of Eq. (R4.6) and 
Eq. (R4.7) through the integral equation (Eq. (R4.4) or Eq. 
(R4.5)) bridges Gillespie’s own intuitive reasoning and 
Feller s [35] mathematically rigorous proof of the solution. 

Now, substitute an “ancestral” sequence state, 
s A (€S' // ), for s 0 in Eq. (R4.6), and take the inner prod¬ 
uct between it and the ket-vector, Is 0 ), of a “descend¬ 
ant” sequence state, s D (eS H ). Comparing the two 
sequence states in S n naturally gives a PWA, a(s A , s D ) 
(e.g., Eq. (R2.1)). 10 Hence, the summation of 
(s A |P / °(t / ,f f )|s°) ’s over all “equivalent” s D ’s providing the 
same a(ff,s D ) must be P[(a(g A ,s D ),[t I ,t F ])\(g A ,tfi\, which 
is the probability that a (s' 4 , s°) results from sequence evo¬ 
lution during [tj, t/ \, given s A at tj. Similarly to the deriv¬ 
ation of Eq. (R4.6), we obtain its formal expression as: 


Here, H ID [N; a(s A ,s°)] denotes the set of all indel his¬ 
tories with N indels each that can result in a(s A , s D ), and 
IV mi „[a(s A ,s°)] is the minimum number of indels re¬ 
quired for creating the PWA. 

Using the set of all PWA-consistent histories, 
H ffl [«(s A ,s°)]^U^ i)i[a(j , isD)] H /0 [iV;a(s A ,s B )], Eq. (R4.8) 
can be further simplified as: 

R[(a(s A ,s D ),[f / ,f J r])|(s A ,f / )] = Y 

[Mi, Mjv] 

cH m [a{! r 4 , s' 3 )] 

P[([M 1 ,M 2 ,---,M N ],[t I ,t F })\{s A ,t I )}. 

(R4.9) 

Equation (R4.8) and Eq. (R4.9) are the formal expres¬ 
sions of the occurrence probability of a(s A , s D ) derived in 
effect from the defining equations, Eqs. (R3.19, R3.20, 
R3.21), of our evolutionary model. Thus, they are the 
“ab initio probability” of the PWA. In the following, we 
will examine its factorability. 

R5. Local history-set equivalence class of indel histories 

Before advancing to the factorability of general PWA 
probabilities, we will introduce an essential concept 
here. For this purpose, we first consider the very sim¬ 
ple PWA, Eq. (R2.1), as an example. (Here we make 
the substitutions, s' 4 = ~u and s° = ~u ). In this case, 

I F 

NminM^, s°)\ - 2, and there are two 2-indel histories 
that can yield this PWA: one is [Md(2, 4), Af 7 (3,3)] 
(Fig. 5a) and the other is [Af 7 (6,3 ),Md( 2,4)] (Fig. 5b). 
Thus, these two indel histories result in the same final 
state: (s F \ = (s 7 |M D (2,4)M 7 (3,3) = <s 7 |AT 7 (6, 3)Af D (2,4) 
(Fig. 5, panels a and b). In other words, the two different 
successive actions of two indel operators have the same ef¬ 
fect on the sequence states (in space S n ). This fact will be 
phrased as “the two products of the operators are equiva¬ 
lent”, and represented by the relationship: 


M 7 (6,3)M D (2,4)~M 7) (2,4)Af 7 (3,3). (R5.1) 


This “binary equivalence” can be generalized to the 
following relationships between two indel events sepa¬ 
rated at least by a PAS: 


r oo 

p[(a(s A ,s D ),{t l ,t F })\{s A ,t I ))} = Y 

N= 

[«(** ,s D )] eH® ,s°)] 

p[([M l ,M 2 ,---,M N ],[t I ,t F ])\(/,t I )\. 

(R4.8) 


Mi(xi,h)Mi(x 2 ,h) ~ Mi(x 2 ,h)Mi(xi + h,h) for xi > x 2 , 

(R5.2a) 


Md(xb,xe)Mi(x, l) ~ Mi(x, 1)Md(xb + 1,xe +1) for xb > x+ 1, 

(R5.2b) 
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a [M fl (2,4),M,(3,3)] 


b [m ; (6, 3), M d (2, 4)] 


<*|(-<* A |) 


<*.l 

= <s A |M D (2,4) 





= {s a \m d (2,4)M,(3, 3) 




= (s A \M,(6, 3) 
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= (s A |A7,(6, 3)M„(2,4) 


c LHS, {[M D (2,4)], [m 7 (6, 3)]} 


& 


Fig. 5 Binary equivalence relation and LHS equivalence class, a An indel history, [m d ( 2, 4), M,(3, 3)]. b Another inde! history, [M,(6,3), m d (2,4)]. 
These histories result in the same final state ((s F | (= (s D |)). Thus, their total effects are equivalent, c Their equivalent local history set (LHS), 
{[M d ( 2,4)], [vVf/(6,3)]}, is represented by the isolated actions of local indel histories on the initial state (each enclosed in a 

dashed box) 




MfxJ)M D (x B ,x E ) ~ M D (x B ,x E )M I (x-f ,lj for x > x E , 

(R5.2c) 

Md(xbi, Xei)Md(xb2i Xei) ~ Md{xb2,Xei)Md[x B \-12i x EI-12^ 
for XBI > XE2 + 1 . 

(R5.2d) 

Here, l'=min{x E -x B + l, x E } in Eq. (R5.2c), and 
l' 2 =min{x E 2 -x B2 + l,x E2 } in Eq. (R5.2d). If you will, these 
equivalence relations could be phrased as follows. “The 
operator representing the event on the left along the se¬ 
quence will not change whether it comes first or second. 
The operator representing the event on the right will 
shift its operational position to the left/right by the num¬ 
ber of sites deleted/inserted before its operation, when it 
comes second”. 11 

Now, we will extend the binary equivalence relations, 
Eqs. (R5.2a-R5.2d), to the equivalence relations among 
more general complex indel histories, each consisting of 
more than two indel events. Let us consider a global his¬ 
tory of N indel events, [Mi,M 2 , ...,M N \, which begins 


with an “ancestral state”, .s ,/ '(e,S' ,/ ), and ends with a “des¬ 
cendant state”, ^(eS 11 ). Obviously, the two states must 
satisfy the equation: 

(s°\ = (R5.3) 

Given an indel history, we can identify PASs unambigu¬ 
ously. Suppose that such PASs separate the indel events, 
M v (v = 1,2,..., Af) in [Mi,M 2 , into K local sub¬ 

sets of indels, each of which is confined either between a 
pair of PASs or between a PAS and an end of the resulting 
PWA. Number the K local subsets as k = 1,2, ...,K from 
left to right, and let A4 be the number of indel events in 
the k th local subset. Naturally, £f = \N k = N. And let 
Mfc^ik] be the element of {M v } v=19 n representing 
the 4 th event (in the temporal order) in the lc th 
local subset (i k = 1, 2,..., N/,; k= 1, 2,.... K). (The prime 
here indicates that the operator is equivalent to the 
prime-less version). Then, repeatedly applying the bin¬ 
ary equivalence relations, Eqs. (R5.2a-R5.2d), between 
the indel operators belonging to different local 
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subsets, we can move the operators around in the 
product in Eq. (R5.3) and get the following equation: 

(s°| = <^| [M[K, 1}---M[K,N k ]]---[M[1,1}--M[K,N 1 }]. 

(R5.4) 

Here M[k, i k ] is an operator that was obtained from 
M'[k, 4] through the series of equivalence relations Eqs. 
(R5.2a-R5.2d) that brought Eq. (R5.3) into Eq. (R5.4). As 
in Eq. (R5.3), the operators in each pair of large square 
parentheses in Eq. (R5.4) are arranged in temporal order, 
so that the earliest event in each local subset will come 
leftmost. But it should be noted that the order among 
the pairs of large square parentheses is the opposite of 
the actual spatial order among the local subsets, so that 
the rightmost one along the sequence (the K th one 
here) will come leftmost. In this way, the operators in 
each local subset, e.g., {M[k, 1], ...,M[k,N k \}, are exactly 
the same as those when the events in the subset alone 
struck (,s A |. Thus the series of operators, 
[M[k, 1], ...,M[k,N k ]], for the k th local subset defines 
the k th local indel history that was isolated from the 
global indel history, [Mi,M 2 , Mn\, on S/eS. 

Now, let us consider a history of N indel events other 
than [Mi,M. 2 ,...,Mn\ - If the temporal operator product 
of the history is shown to be equivalent to Eq. (R5.4) 
through a series of Eqs. (R5.2a-R5.2d), then it should 
also be connected to Eq. (R5.3) though another series 
of Eqs. (R5.2a-R5.2d). Therefore, it should be equivalent 
to [Mi,M 2 , in this sense. Hence, we can define a 

particular equivalence class, which is the set of all global 
indel histories that can be “decomposed” into the identical 
set of local indel histories, such as Eq. (R5.4), only 
through a series of Eqs. (R5.2a-R5.2d), between indel 
operators separated by at least a PAS. We will call it 
the “local-history-set (LHS) equivalence class”. In 
the equivalence class defined by a local history set 
(LHS), {[M[k,l},...,M[k,N k }]} k=1 K (with Ef = 1 A4 = 
N), on an initial sequence state s A e S H , there are 
—P LHS-equivalent global indel histories begin- 

IL.A* 

ning with s' 4 . Each of the global histories corresponds 
to a way of reordering N indel events while retaining 
the relative temporal order among N k events within 
the k th local indel history (for every k= 1 , ...,/<). 

In the simplest example at the beginning of this sec¬ 
tion (Eq. (R5.1) and above), the corresponding LHS is: 
{ [M d (2, 4)], [Mj(6, 3)] }. The LHS consists of two local 
histories, each of which is a single-indel history (Fig. 5c). 
As a slightly more complex example, consider the history, 
[Md(3, 3),M/(5, 2),Md(2, 3),Mj(5, 1)], illustrated in 
Fig. 4. This history belongs to the LHS equivalence 


class represented by the LHS: { [M b (3, 3),Md( 2, 3)] , 
[Af/( 6 , 2 ),M/( 8 , 1 )]}, which consists of two 2 -indel local 
histories. If this LHS is recast into the form in Eq. (R5.4), 
we have: M[l, 1] = M D (3, 3), M[ 1,2] = M D ( 2, 3), Af[2,1] 
= M/( 6 ,2), and M [2,2] = Af/( 8 ,1). 

R6. Factorability of pairwise alignment probability: brief 
description 

Now we are ready to examine the factorability of the 
ab initio probability of PWA a(s A , s°), P^a^, s D ), [t b 
tf-])\{s A , t[)\ in Eq. (R4.9), given the ancestral state (s A ) 
at the initial time ( t 7 ). Here the “factorability” means 
that the PWA probability can be re-expressed as the 
product of an overall factor and contributions from 
local regions. Natural candidates for the local regions 
would be those in between the PASs, because we 
know that indels never hit or pierced PASs (if the 
alignment is correct). We are not interested in trivial 
factorability. Thus, we only consider PWAs (or global 
histories) each of which requires at least two local 
indel histories. In the following, we will only briefly 
describe our demonstration of the PWA probability 
factorization. Its more detailed yet rather intuitive de¬ 
scription is given in Supplementary methods SM-2 in 
Additional file 1. (It is complemented by mathematic¬ 
ally rigorous arguments in Supplementary appendix 
SA-2 in Additional file 2). 

We first notice that each component probability, 
P[([M 1 ,M 2 ,---,M N \,[t I ,t F ])\(s A ,t I )] given by Eq. (R4.7), 
will not be factorable. This is because its domain of 
multiple-time integration is not a direct product. So, 
lets consider the total probability of a LHS equivalence 

class, [aJihs (with M abbreviating { [M[k, 1], M 
[k, AIjr]]}>=i,...,sr) : 

/>[([Af 1 ,Af 2 ,-,Ar Ar ],[t / ,fr]) |(sV/)]- 

(R6.1) 

We can show that this probability can be factorized as: 

K 

= \\_dp[{[M[k, l],...,Af[Ar,Af*]], [t I ,t F ])\(s A ,t I )], 
k=1 

(R6.2) 

where 
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l*p\([M[k,l),...,M[k,N k ]\, [t u t F ]) (A,£ 7 ) 


with a(A, s°). Hence, the PWA probability, Eq. (R4.9), 
can be rewritten as: 


=P[([M[k, 1],...,M [£, AT*-]], 

[t, , t F \ ) | (A, h) ] /P [( [], [ti, t F }) | (A, £,) ], 

(R6.3) 

M(feLs’ fr> f *])|(A&)] 

= p [( LHS , A tf]) | (s' 1 , £/)] /R[(D, A A) I (A, */)], 

(R6.4) 


if the following two conditions are satisfied. 

Condition (i): The rate of an indel event 
( r(M v ;s v _i, t v ) ) is independent of the portion of the 
sequence state (s v _ i) outside of the region of the local 
history the event (M v ) belongs to. 

Condition (ii): The increment of the exit rate due to 
an indel event {SR l x{s v , s v - 1 , t)=R'x(s v , t) - R'x(s v _ 1( r), 
with (s y | = (s v _i|Af v ) is independent of the portion of 
the sequence state (s v _ i) outside of the region of the 
local history the event (M v ) belongs to. 

See Supplementary appendix SA-2 in Additional file 2 
for the derivation of the mathematically rigorous version 
of this set of conditions. (For illustration, in Supplemen¬ 
tary methods SM-3 in Additional file 1, the factorability 
of the probability was examined for the simplest con¬ 
crete LHS equivalence class (in Fig. 5)). Condition (i) is 
somewhat similar to the “context-independence” condi¬ 
tion imposed on the “long indel” model [ 21 ], though our 
condition is slightly less restrictive. Condition (ii) has 
never been found or even discussed thus far. In fact, the 
“long indel” model trivially satisfies this condition (see 
subsection R8-1), thus [21] did not need to pay attention 
to it. However, this condition is not always satisfied. In¬ 
deed, as exemplified in subsection R8-2, some models 
have non-factorable alignment probabilities due to the 
violation of this condition even though condition (i) is 
satisfied. 

Each global indel history (in the set of all PWA- 
consistent indel histories, H /B [a:(A,A)] ) belongs to a 


single LHS equivalence class (say, 



And all ele¬ 


ments of 


M 


can result in the same PWA. Therefore, 


LHS 


we get the direct sum structure: 


H /Z) [a(s A ,s D )] 


u 


MeA®[ii(s < , 



(R6.5) 


where A /£ ’[a(s A ,s D )] is the set of all LHSs consistent 


/ , [(a(A,A),[£/,f f ])|(A,£ / )] 

% £ 'IALA’AaA 


(R 6 . 6 ) 


We are considering all indel histories, including non- 
parsimonious ones, that can yield a(A, s°). Thus, the 
LHSs belonging to A ID [c^A,/ 5 )] may consist of differ¬ 
ent numbers of local histories. We will choose the max¬ 
imum possible set of PASs in the given PWA, which 
separates the PWA into the finest potentially local- 
history-accommodating regions, denoted as y 1 ,y 2 ,..., 
Y Kmax ■ ( K m ax is uniquely determined by the PWA and 
the evolutionary model). 12 Then, we can represent 

any Af = { 1],..,M[f,W t ]] } t=1 _ [a(/,/)] 

as a vector with K max components: 

M= AL[y 2 ],...,Af [x^J). (See Figure SI in 

Additional file 1). Substituting Eqs. (R6.2,R6.4) into 
Eq. (R6.6), and exploiting the vector representation of 
the LHSs, we can reach the desired expression: 

p[(a(A,A),[£/,£ F ])|(A,£/)] 

= ;»[(□, MpDI (A*/)] 

X [(A /D [y K ; a(A, A)], [t u t F \ ) | (A , £/)]. 

K—l 

(R6.7) 

Here, A ID \y K \a(s A , A)] denotes the set of local indel 
histories that can give rise to the sub-PWA of ct(A, s°) 
confined in y K , and the multiplication factor, 


i*P [ A® [>v; a(A A], [h Tf] ) I (A f/)] 

= J2 aa)| (At/)], 

M [/*.] eA® ^;a(s^ ,s D )J 

(R 6 . 8 ) 

represents the total contribution from A /£ '[y K ;a(s A , A)] 
to the PWA probability. (Here £</■[([],[£/, A])|(A, f/)] = l 
should be remembered). 

Equation (R6.7) states that the PWA probability is fac¬ 
torized into the product of an overall factor 
(P[([], [£/, £f])|(A, £/)]) and contributions from regions ac¬ 
commodating local indel histories {ft p [[A ID [y K ; a(A, A)], 
[£/, t F \) | (A, i/)] ’s). Therefore, the set of conditions, (i) and 
(ii), is sufficient for the factorability of the PWA 
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probability. At present, we are not sure whether the 
set of conditions is also necessary or not. This may 
not be the case in the rigorous sense, and there may 
be some instances with factorable PWA probabilities 
despite the violation of condition (i) or (ii). Neverthe¬ 
less, even if there are, we suspect that such cases 
should be isolated, requiring intricate cancellations of 
the terms. Thus, we will refer to the conditions (i) 
and (ii) as the “sufficient and nearly necessary set of 
conditions” for factorable PWA probabilities. 


R7. Factorability of multiple sequence alignment 
probability: brief description 

Thus far, we only examined the probability of a given 
PWA, conditioned on an ancestral state at initial time. 
Actually, once we know how to calculate such condi¬ 
tional PWA probabilities, we can build them up along 
the phylogenetic tree to calculate the probability of a 
given MSA, as described in the introductions of [13] and 
[14]. (See also [36] for an essentially equivalent method 
that appears different.) Here, we basically follow their 
procedures. However, it should be stressed that the MSA 
probability here will be calculated ab initio under a genu¬ 
ine evolutionary stochastic model and not under a HMM 
or a transducer, which is not necessarily evolutionarily 
consistent. This section briefly explains the derivation of 
the factorization of an ab initio MSA probability. For de¬ 
tails on the derivation, see Supplementary methods SM-4 
in Additional file 1. 

In this section, we formally calculate the ab initio 
probability of a MSA given a rooted phylogenetic tree, 
T = ({«} 7 i {b}j), where {n} T is the set of all nodes of the 
tree, and {b} T is the set of all branches of the tree. We de¬ 
compose the set of all nodes as: {n} T = N in (T) +N x (T), 
where N 1n (T) is the set of all internal nodes and N x 
(T) = {«i,...,, n N x} is the set of all external nodes. 
(The N X =\N X {T)\ is the number of external nodes.) 
The root node plays an important role and will be 
denoted as n Root . Because the tree is rooted, each branch 
b is directed. Thus, let n A (b) denote the “ancestral 
node” on the upstream end of b, and let n D (b) denote 
the “descendant node” on the downstream end of b. 
Let s(n)eS u be a sequence state at the node ne{n} T . 
Especially, we use abbreviations: s A (b)=s(n A (b)) e S" and 
s D {b)=s(n D [b)) eS n . Finally, as mentioned in Background, 
we suppose that the branch lengths, {\b\ |b e {b} T }, and the 
indel model parameters, {@aAb)}js{0 II ^b)\b 6 {b} T }, are 
all given. Note that the model parameters &io{b) could de¬ 
pend on the branch, at least theoretically. 

First, we extend the ideas proposed in [13, 14, 36] 
to each indel history along a tree, by regarding the 
indel history along a branch as a map (or a trans¬ 
formation) from the ancestral state to the descendant 


state, as follows. An indel history along a tree con¬ 
sists of indel histories along all branches of the tree 
that are interdependent, in the sense that the indel 
process of a branch b determines a sequence state 
s D (b ) at its descendant node n D (b), on which the indel 
processes along its downstream branches depend. 
Thus, an indel history on a given root sequence state 
s Root = s{n Root ) s S n automatically determines the se¬ 
quence states at all nodes, {s(w) e S 11 for v « € {n} T }. 
Let H /D (so)sU^ =0 H /£) (A/'; So) (with H ID (N;s 0 ) defined 
below Eq. (R4.6)) be the set of all indel histories along 
a time axis (or a branch) starting with state s 0 . Then, each 

indel history, |m(&)| , along tree T and starting with 
s Root can be specifically expressed as: 

ir(i.HM 1 (i),4 #) (S)]£H' D (/(J))«»1 \s(n Moo ‘) =s Ro °‘ 
(/>(i)|=(s' 1 (A)|M 1 (i)--M A , w (l.) for v b£{b} T \ 

(R7.1) 

Here, the symbol, M v (b), denotes the v th event in 
the indel history along branch bs{b} T . The probabil¬ 
ity of the indel history, Eq. (R7.1), can be calculated 
as: 


{M (b)} T \(f° ot ,n Roc 

= ( II p \{M{b),b) !(/(&),tf 4 ^)) 


fie{b} T 


£ ^ yjRoot —^Root 

y>(i)|=(«*(i)[ At, (*)-*„(», (*) 
for *be{b} T 


(R7.2) 


Here, the probability of an indel history, M{b) = 
...,M N ^(b)]eH ID ^(b)), along branch bs{b} T 
is given by the probability during the corresponding time 
interval, [t{n A (b)), t(n D (b))\: 


P[(M (b),b)\{s A (b),n A (b))} 

^^[([Ar!^), —.iCr^*)^)], [t(w A (*)),t(w D (*))])|(* A (*), 
t{n A {b)))]\ @ID (b). 

(R7.3) 

Here we explicitly showed the branch-dependence of 
the model parameters. 

Now, consider a MSA, a[si,S 2 , among the se¬ 

quences at the external nodes, s t = s(«,-) e S u («,■ e N X (T)). 
(Remember that the term “MSA” here means its hom¬ 
ology structure, as noted in endnote (10)). Let 

(s Root , {M(h)} r ) be a pair of a root state and an indel 

history along T starting with the state. And let 'T ,,) 
[«[si,S 2 , T] t> e the set of all such pairs defined on 

T consistent with a[si,S 2 , ••.,%*] • Then, analogously to 
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Eq. (R4.9) supplemented with Eq. (R4.7) for a PWA, the 
probability of a given MSA under a given model setting 
(including T) should be expressed as: 


i>[a[si,S2,-,Vr]|r] = 

(s Roof , 


[«[*!,s 2 , T 

>[(/ 00t , n Root )]P ^M(b )} (s Root : n 


..Root s 


(R7.4) 


which is supplemented with Eq. (R7.2). Here, 
P[(/ ooi , n Root )] is the probability of state ^ oot at the root 
node. (It may be interpreted as the prior in a Bayesian for¬ 
malism.) If you will, Eqs. (R7.4) and (R7.2) could be con¬ 
sidered as the “perturbation expansion” of an ab initio 
MSA probability. To make this formal expansion more 
tractable, let {s{n)} N m={s{n)eS\neN IN {T)} denote a set 
of ancestral states at all internal nodes (, or, more pre¬ 
cisely, its equivalence class in the sense of endnote (8)). 
To be consistent with a given MSA, the ancestral states 
must satisfy the “phylogenetic correctness” condition in 
each MSA column {e.g., [37, 38]). 13 Let Z[a[si,S 2 , 
{neN m (T)}; T] be the set of all {s(«)} iv m’s consistent 
with a[si,S 2 , •••,%*] (and tree T). Then, Eq. (R7.4) supple¬ 
mented with Eq. (R7.2) can be rewritten as: 

p[a[si,s 2 ,...,%*]|r] = Y 

{s(n)} N m 

eZ[a[ Sl ,s 2 , -,%*]; {«eN /w (T)}; T] 
p[«[si,s 2 , {«(«)}^™|t] . 

(R7.5) 

Here, / > [a[si,S 2 , {s(«)} JV w|T'] is the probability 

of simultaneously getting a[si,s 2 ,...,Sj V Jc] and {s(n)} Af cv. 
This probability is the sum of contributions from all 
indel histories sharing the same homology structure 
among sequence states at all nodes. Especially, the se¬ 
quence states at internal nodes have homology struc¬ 
tures (with the states at other nodes) fixed for respective 
nodes. Thus, through some reasoning (explained in SM- 
4), we get: 

.p[a[si,s 2 , ...,%*]; {s(B)} N w|r] 

= p\(s r °°‘ 1 n Root )\ P |"(a(s A (i>), s°(b)), b) I {^{b), . 

(R7.6) 


Here, 


P [ (a (s ' 4 ( b ), s° (b )), b) | (V 4 {b) , ( b )) j 

=P [(a^O), s°{b )), ^(r^ib )), t(vP(b ))j) | (^(b ), 

(R7.7) 

is the probability of the ancestor-descendant PWA along 
branch b. This Eq. (R7.6) is basically the expression pro¬ 
posed in [13, 14], and we demonstrated in effect that 
their proposal also holds even with a genuine stochastic 
evolutionary model. Usually, Eq. (R7.5) supplemented 
with Eq. (R7.6) is much more tractable than Eq. (R7.4) 
supplemented with Eq. (R7.2), because of the two rea¬ 
sons. (1) Usually, it is not the indel history along the tree 
but (the homology structure of) the set of ancestral 
sequence states that is inferred from a given MSA. 
(2) The probability of each indel history along the 
tree (Eq. (R7.2)) is not factorable in general, whereas 
Eq. (R7.6) is a product of PWA probabilities, each of 
which should be factorable if the conditions (i) and 
(ii) in section R6 are satisfied. 

Now, we can show that, if the “condition (iii)” given 
below in addition to conditions (i) and (ii) is satisfied, 
we can factorize the MSA probability into a form some¬ 
what similar to Eq. (R6.7) for the PWA probability. In 
subsection 4.2 of [32], we demonstrated it using the 
history-based expansion of the MSA probability (i.e., Eq. 
(R7.4) supplemented with Eq. (R7.2)). In Supplementary 
methods SM-4, we will use the ancestral-state-based ex¬ 
pansion {i.e., Eq. (R7.5) supplemented with Eq. (R7.6)), 
as was only briefly sketched at the bottom of subsection 
4.2 of (ibid.). In a MSA, gapless columns play almost the 
same role as PASs in a PWA. Because of the aforemen¬ 
tioned “phylogenetic correctness” condition, a gapless 
column indicates that no indel event hit or pierced the 
site. Therefore, gapless columns will partition a MSA 
into regions each of which accommodates a local subset 
of every global history. Analogously to the argument 
above Eq. (R6.7), let Ci, C 2 ,..., Cx max be the maximum 
possible set of such regions determined by a given MSA 
and a model setting (including the tree) (Figure S2 in 
Additional file 1). (As argued in subsection R8-3, all gap¬ 
less columns are not necessarily needed to delimit the 
regions.) Because the summation in Eq. (R7.5) involves 
the summation over all MSA-consistent root states, it 
would be convenient to specify a “reference” root state, 
So° ot . It can be anything, as long as it is the state at the 
root consistent with a[si,s 2 , •••,•%*] • Technically, one 
good candidate for So° ot would be a root state obtained 
by applying the Dollo parsimony principle [39] to each 
column of the MSA, because it is arguably the most 
readily available state that satisfies the phylogenetic 



Ezawa BMC Bioinformatics (2016) 17:304 


Page 18 of 25 


correctness condition along the entire MSA. Then, we 
will impose the following condition. 

Condition (iii): 

r>\( Root ..Root\ 1 t\\( Root ..Root \ 1 I I \JZoot Root ..Root./-- i 

/'[(s' ,n )J=/'[(s^ ,n )]_[]_ ,n ;C K \. 

K—l 

(R7.8) 

Here the multiplication factor, pp[s Root ,s$ oot ,n Root -,C K ], 
represents the change in the state probability at the root 
due to the difference between s Root and s Root within C K . 
This equation holds, e.g., when P[(s Root , n Root )\ is a geo¬ 
metric distribution or a uniform distribution of the root 
sequence length. 14 

Under the conditions (i), (ii) and (iii), through a series 
of formal calculations and reasoning, Eq. (R7.5) supple¬ 
mented with Eq. (R7.6) can be re-expressed into the final 
factorized form: 

P[a[si,s 2 , •••,%*] | T] 

= P 0 [4 oot \T]Y[M p [a[s 1 ,s 2 ,...,s NX };4 oot ;C K \T}. 

K—l 

(R7.9) 

Here, 

P 0 [C t |r]^[(4 00t , K feo 0M{[]}r| (C*, n Root )} 

(R7.10) 

is the probability of having state s Root that has been in¬ 
tact all across tree T, andMp[a[si, s 2 , Ck\T] 

is the multiplication factor contributed from all local 
indel histories (along T) confined in C ’ K . 15 Briefly, the 
multiplication factor is the summation of terms over all 
possible sets of the MSA-consistent ancestral states in 
C K . And each of the terms is the product of the local 
PWA multiplication factors (Eq. R6.8) confined in C K 
(Figure S2 in Additional file 1), the exponential of minus 
the summation over all b’s of the time-integrated exit 
rate differences between s A (b) and s Root coming from C K , 
and p P [s Root , s Root , n Root ; C K \ for the root state probability. 
(For the factor’s exact expression and the detailed deriv¬ 
ation of Eq. (R7.9), see Supplementary methods SM-4 in 
Additional file 1). 

R8. Examples: Models with factorable/non-factorable 
alignment probabilities 

A merit of conditions (i) and (ii) given in section R6 is 
that they can draw the line between evolutionary models 
with factorable PWA probabilities and those with non- 
factorable ones. To illustrate the use of these conditions, 
we here give three examples: (1) a simple model with 
factorable probabilities, (2) a simple model with non- 


factorable probabilities, and (3) a non-trivial model 
with factorable probabilities. (For more examples, see 
section 5 of [32]). 

R8-1. Totally space-homogeneous model 

The simplest conceivable indel models would be those 
whose indel rate parameters are space-homogeneous, 
i.e., independent of the positions where the indels 
occur: 

r I {x,l 1 ;s,t)=g I {l 1 ,t), (R8-1.1) 

rD(x B ,x B + l 2 -l;s,t) =g D {l 2 ,t). (R8 - 1.2) 

In fully space-homogeneous models, these equations 
hold for 1 < x < L(s) - 1, 1 < /i < Lf°, 1 < l 2 < i£°, and 
2 - l 2 < x B < L(s), where L c f° and L%° are the “cut-off 
lengths” for insertions and deletions, respectively. (Depend¬ 
ing on the model, r/0, l; s, t) = gij.(l, t) and rj{L{s), l; s, t) - 
gijt{l, t ) could differ from g/(l, t) in Eq. (R8-1.1)). In fact, 
these conditions were imposed in nearly all continuous¬ 
time Markov models of indels that were studied in the 
past. Note that the rate parameters in Eqs. (R8-1.1.R8-1.2) 
could depend on time, although most studies thus far as¬ 
sumed that the rates are time-independent as well. Eqs. 
(R8-1.1,R8-1.2) automatically guarantees condition (i). 
Thus, all we have to do is check whether or not condition 
(ii) is also satisfied. Indeed, we can show it is. The exit rate 
of this model is calculated by substituting Eqs. (R8-1.1.R8- 
1.2) into Eq. (R4.3) (supplemented with Eqs. 
(R3.14,R3.15)), and we find that it is an affine function of 
the sequence length (L(s)): 

Rf (s, t) = A(t)L(s) + B(t ), (R8 - 1.3) 

with A(t) = f ) + t) and B{t) = 

E/f i ('-w. t) +Ei^ M', t) + &A 1 ’ *)) • If 

the exit rate is affine, we have, for (s(v)| = (s(v-l)\M v : 

8R™ (5(v),s(v-l), f) = R^{s{v),t)-Rf (s(v-l), t) 

= A(t)[L(s(v))-L(s(v-l))\ = A(t)8L(M v ). 

(R8 - 1.4) 

Here 8L(M V ) is the length change caused by the event, 
M v . The rightmost hand side of this equation depends 
only on M v and the time it occurred, but not on the se¬ 
quence state (s(v- 1)). Thus, condition (ii) is always sat¬ 
isfied under fully space-homogenous models, which 
means that alignment probabilities calculated ab initio 
(as in section R4) under such models are factorable, as 
shown in section R6. 

An important special case of the space-homogeneous 
model is the model used by Dawg [26], whose indel rate 
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parameters are given as: g f (l, t ) = g /;/ .(l, t ) = g l;R {l, t) = A/ fj(l) 
and g D (l, t) = X D foil)- Because this is a special case of Eqs. 
(R8-1.1,R8-1.2), it naturally provides factorable alignment 
probabilities. This model is probably among the most flex¬ 
ible indel evolutionary models used thus far. The model 
accommodates any distributions of indel lengths (fj(l ) and 
f D {[)) that are independent of each other, and independent 
total rates for insertions and deletions (A/ and A/,). Some 
of our studies [40, 41] are mostly based on this model. 

Another important special case is the “long indel” 
model [21], whose (time-independent) rate parameters 

are given by: g,(l, t) = A,, =gj. R (l,t) = \\ end) (if 

L{s) > 0), g /;i (/, t) = \ ( ™ hole) (if i( 5 ) = 0), and g D (l, t ) = p t . 
This model is less flexible than Dawg’s model, because 
its indel rates are subject to the detailed-balance condi¬ 
tions: A/= (Ai/^i)Vz, X\ end) = (Ai/^y^and \ [ ™ hole) 

= (Ai//r 1 ) Zhf-X ~ 1 + ■ Like D aw g’ s model, this 

model is a special case of the model defined by Eqs. (R8- 
1.1,R8-1.2). Thus, the alignment probabilities calculated 
under it are indeed factorable, as verbally justified in 
[21]. Indeed, we can explicitly show that, as far as each 
LHS equivalence class is concerned, the indel compo¬ 
nent of its probability calculated according to the recipe 
of [21] equals the product of P[([], [t It fy-DK^. £/)] an d 
Eq. (R6.2), i.e., the “total probability” of the LHS equiva¬ 
lence class via our ab initio formulation, calculated with 
the aforementioned indel rate parameters. The proof 
is given in Supplementary appendix SA-3 in Additional 
file 2. It should be stressed that, although [21] ignored 
condition (ii), this caused no problem thanks to Eq. (R8-1.4) 
satisfied by any fully space-homogeneous models. Actu¬ 
ally, it is this condition (ii) that guarantees the equivalence 
of the probabilities calculated via the two methods, be¬ 
cause it equates each increment of the exit rate of a chop- 
zone with that of an entire sequence. The equivalence can 
be extended to between PWA probabilities, provided that 
the contributing local indel histories are correctly enumer¬ 
ated. (We are uncertain about whether this extended 
equivalence indeed holds, because [21] did not explicitly 
describe how the local indel histories were enumerated). 

Regarding the insertion rates, we could somewhat 
relax the space-homogeneity without compromising the 
factorability of alignment probabilities. For example, the 
insertion rates could depend on the ancestries, u(s, x) 
and v(s, x + 1), of sites flanking the event: 

ri(x,l;s, t) = g I (v(s,x),v{s,x+ l),l,t)- (R8 - 1.5) 

These rates satisfy condition (i). Eq. (R8-1.5) combined 
with the space-homogeneous deletion rates, Eq. (R8-1.2), 
still gives an exit rate whose increment due to an indel 


event depends only on the inserted/deleted sub¬ 
sequence (and flanking sites) but not on the regions sep¬ 
arated from it by at least a PAS. Hence the model also 
satisfies condition (ii), thus providing factorable align¬ 
ment probabilities. Relaxing the space-homogeneity of 
deletion rates, however, is somewhat difficult, particu¬ 
larly because of condition (ii). In subsection R8-3, we 
will attempt it. 

R8-2. Space-homogeneous model flanked by biologically 
essential sites/regions 

The space-homogeneous models discussed above may de¬ 
cently approximate the evolution of a sequence region 
under no selective pressure. A real genome, however, is 
scattered with regions and sites under strong or weak 
purifying selection. Here, we consider one of the simplest 
such cases, in which biologically essential sites or regions 
flank a neutrally evolving region from both sides. 16 The 
insertion rates of this model are given by Eq. (R8-1.1) with 
the same domain, and the deletion rates are: 

r D (x B ,x E -,s,t) 

g D (XE-Xe + 1, t) for 1 <x b S.Xe<L(s) and 1 <x e -Xb + 1 Si™, 
0 for x/;<(). x E > L(s) or X B ~X B I 1 > 7]]°. 

(R8-2.1) 

The exit rate for this model is calculated as: 

Lf° Lf° 

Rx(s> t) = (£«-l)$>(/. t) + ^2 t ) + gj.Jl, t )) 

/= 1 1=1 

min{L(s), L 

+ ^2 (I(s)-Z + 1)g D {l, t). 

1=1 

(R8 - 2.2) 

For L(s) >L c l S°, this is affine, and given by Eq. (R8-1.3) 

_ yL co 

with exactly the same A(t) as before and B(t) = 

_ .tCO _>/ co / \ 

g i^ + YhU (zi-A 1 ’ + g i-A 1 ’ • 

Therefore, with such a sequence length, the alignment 
probability is still factorable even under this model. For 
L(s) < Lp 0 , in contrast, it exhibits a non-ajfine form: 

Lf° Lf° 

Rjf(M) = (IM-1)$>(/. t) + J2 (gj-j.il, t)+g I;R (l,t)) 
1=1 1=1 

L(s) 

+ ^2(L(s)~l + 1 )g D (l, t). 

i=i 

(R8 - 2.3) 

Thus, in this case, condition (ii) will not be satisfied in 
general, whereas condition (i) is satisfied. This case gives 
the simplest example of a model with non-factorable 
PWA probabilities despite space-homogeneous rates of 
indels (as long as they are allowed). In a model with 
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non-factorable alignment probabilities, the “difference of 
exit rate differences”: 


SSRx ( s" , s' , s , s, t)sSR J x ( s" , s' , t)-SR ® (s,s, f), 


(R8 - 2.4) 


^X;Base (*> 0 = l ’ S ’ *) 

v-'£(*) ■^XB+ig 0 -! , 

-E / f co n / , .i rojj ase yXBj XEi s 1 1 j 

Z-~/xb=-Lq I 2 Z—^x, = m;ix{. 1 [ ’ ' 

(R8 - 3.4) 


is the baseline exit rate. And 


where (s 7 1 = (s|M Vl , (s” | = (s|M V2 and (s w | = ( s\M Vl 

M Vl , indicates the “degree of non-factorability” due to 
the pair of events, M Vl and M V2 , that belong to different 
local histories. (See the argument around Eq. (5.2.6) of 
[32] for more details.) 


A R x [ E y] ( s > t) -XL=* s . v ( s )Xw=i Ari [ E r] (*> k s > l ) 


=X B y(s)i 

^X E .y( S ) 


^XEy(s) 

jxe=x b 


Y; E * n £^ S V D [ Ey ]( M ;M) 

Z— ^XB—XB-y[s) Z— 'Xe=X B L /J 


(R8 - 3.5) 


R8-3. Model with rate-heterogeneity across regions 

It is not only space-homogenous models but also 
some space-heterogeneous models that satisfy both 
conditions (i) and (ii), albeit partially. Here we give 
an example. We first define a set of non-overlapping re¬ 
gions, E y(si)=[xB ;y ,x% y \ (with y = 1, that existed in 

the initial state, S/E S . We define the “descendant region”, 
E^s), of E y (s 7 ) in a descendant state (s) by the closed inter¬ 
val, E y (s)=[xB;y(s),XE-,y(s)], where Xs-, y (s) and x E -, y (s ) are the 
leftmost and the rightmost sites, respectively, among those 
descended from the sites in E y (s/). Then, based on them, we 
define an indel model whose rate parameters are given by: 

rj(x, /; s, t ) = n-Baseix, l\ s , t ) + y^Ar/ [E y \ (*, l; s, t). 

(R8 - 3.1) 

X El S, t) — r . Xf.S . tj 

+y^, r _iAf£) [Ey] {xb,X E -, s, t). 

(R8 - 3.2) 

Here, the “baseline” indel rates, {r I;Ba se{x, k s, t)} Xt i 
and {rD-BaseixB.XE^s, t)} XsXE , are given by Eq. (R8-1.5) 
and Eq. (R8-1.2), respectively. The region-specific incre¬ 
ments, {Ar,\E y ](x, l;s, t)} xJ and {^ D \E y ]{x B ,x E ]S, t)} XBXE > 
can be non-zero only within E y {s)=[x B . y (s),x E;y {s)\ defined 
above (panel a of Figure S3 in Additional file 1). Moreover, 
the increments can depend only on the portion of the se¬ 
quence state within E y (s). The increments can be negative, 
as long as the entire rates, Eqs. (R8-3.1,R8-3.2), are non¬ 
negative. From Eqs. (R8-3.1,R8-3.2), the exit rates can be 
decomposed as: 

t) = R’^aseis, t) + ^< [E y \ (s, t). (R8 - 3.3) 

y= i 


is the increment of the exit rate confined in, and 
dependent only on, E^s) (y=l,..., Y). As explained at 
the bottom of subsection R8-1, R^Baseis, t) alone gives 
factorable alignment probabilities. And the increments, 
{A7<^P[E y ](.s, f)} y K behave independently of the por¬ 
tions of sequence states outside E y (s). Thus, if each indel 
event is completely confined in any of the E^.sO’s or in 
any spacer regions between neighboring E v (s) s (Figure S3, 
panel a), the alignment probability can be expressed as the 
product of the overall factor and the contributions from 
Ej,(s)’s and within spacer regions. Even if some events 
within a E y (s) are separated from the others by at least a 
PAS, they must be put together into a single local indel 
history (panel a). A complexity arises because deletions 
may stick out of a E y (s), or even bridge two or more re¬ 
gions (panels b and c). The rates of such deletions and 
indels that are completely outside of the regions are given 
by the baseline rates. When a deletion sticks out of a re¬ 
gion, the region will be extended to encompass the dele¬ 
tion, and all events within the extended region are lumped 
into a single local indel history (panel b). When a deletion 
bridges two or more regions, a “meta-region” encompass¬ 
ing all bridged regions is defined, and all events within the 
meta-region will form a single local indel history (panel c). 
In contrast, the indels completely outside of the regions 
should be independent of each other as long as they are 
separated by at least a PAS. Hence, under this model, the 
PWA probabilities are “factorable” in this somewhat non¬ 
trivial sense. 

In Supplementary appendix SA-3 in Additional file 2, 
we explicitly showed that the probability of a LHS 
equivalence class via the recipe of [21] is identical to that 
calculated via our ab initio formulation. Although we as¬ 
sumed the space-homogeneity there, the proof can prob¬ 
ably be extended to the model in this subsection as well, 
by slightly modifying the definition of the “chop-zone”. 

R9. Merits, possible extensions & applications, and 
outstanding issues 

In this paper, we presented a theoretical formulation 
built up by tools that help mathematically precise 


Here, 


Ezawa BMC Bioinformatics (2016) 17:304 


Page 21 of 25 


dissection of the ab initio calculation of alignment prob¬ 
abilities under genuine stochastic evolutionary models. 
Another merit of this formulation is that it gives intui¬ 
tively clear pictures. For example, the insertion and dele¬ 
tion operators simply mathematically represent the 
intuitive pictures of the indels naturally acting on se¬ 
quences (Fig. 3). Thus, the action of the rate operator, 
given by Eqs. (R3.6-R3.10) (or Eqs. (R3.ll-R3.15)), is in¬ 
tuitively understandable as merely the collection of all 
possible single-mutational channels from a given se¬ 
quence state (and some compensating terms). Then, the 
expansion formula for the action of the finite-time tran¬ 
sition operator, Eqs. (R4.6,R4.7), can also be intuitively 
interpreted as the collection of contributions from all 
possible mutational processes starting with an initial se¬ 
quence state. Importantly, this expansion was not posed 
via a hand-waving argument but rigorously derived as 
the solution of the defining equations of the model (Eqs. 
(R3.19-R3.21)), which justifies its ab initio status. And 
the integral equations, Eqs. (R4.4,R4.5), bridged the ex¬ 
pansion’s mathematically rigorous and intuitive aspects. 
Finally, the binary equivalence relations, Eqs. (R5.2a- 
R5.2d) {e.g., Fig. 5), and their resulting LHS equivalence 
classes also allow intuitive interpretations as the invari¬ 
ance of the local effects of indels under their relative or¬ 
ders with spatially separate events. Therefore, the 
conditions for the factorability of PWA probabilities are 
also intuitively understandable. Although their mathem¬ 
atically rigorous derivations (in Supplementary appendix 
SA-2 in Additional file 2) might appear somewhat for¬ 
midable, they are actually not so difficult once the afore¬ 
mentioned intuitive pictures are understood. Hence, by 
coupling the mathematical preciseness with the intuitive 
clarity, our theoretical formulation is expected to facili¬ 
tate further advances of the study of ab initio alignment 
probabilities under genuine stochastic evolutionary 
models with some biological realism. 

For clarity, this study focused only on indels among 
various mutational types, because indels are essential for 
creating a sequence alignment. If desired, however, our 
theoretical formulation could also incorporate substitu¬ 
tions ([31]; see also [42]). Moreover, the formulation 
could also be extended to incorporate other genome re¬ 
arrangements, such as duplications and inversions. (See 
[31] for an initial, rudimentary attempt.) Such an ex¬ 
tended formulation will provide tools to enable concrete 
analyses of “rate grammars” extended to incorporate 
genome rearrangements (briefly mentioned in [21]). 

The practical use of our formulation depends on how 
efficiently it can calculate quite accurate alignment prob¬ 
abilities. Although the factorability of alignment probabil¬ 
ities will help greatly speed up the computation, the 
contribution from each local region {e.g., Eq. (R6.8) or Eq. 
(SM-4.22) in Additional file 1) is still composed of 


infinitely many terms. Good news is that the first approxi¬ 
mation of each local contribution, which is the summation 
of the terms from parsimonious local indel histories alone, 
is quite accurate, as long as the gap lengths and the 
branch lengths are at most moderate (Ezawa, unpublished; 
draff manuscripts: [40, 41]). Thus, considerably efficient 
computation of considerably accurate alignment probabil¬ 
ities may be possible based on our formulation. Especially, 
at least when the model is spatiotemporally homogeneous, 
our ab initio calculation was shown to be equivalent to 
the calculation of [21], with a caveat (see subsection 
R8-1). Thus, their dynamic programming (DP) may 
be applicable, possibly with some modifications, to a 
wider class of models with factorable probabilities. 

Despite these favorable aspects, our theoretical formu¬ 
lation still has some limitations and outstanding issues. 
First, we did not examine whether our “sufficient and 
nearly necessary” set of conditions for the alignment 
probability factorization is exactly necessary and suffi¬ 
cient or not. Nor did we provide any counterexamples 
that violate our set of conditions and still have factorable 
alignment probabilities. Solving these problems may be 
interesting at least mathematically. 

Second, although in this study we tentatively used the 
set of positive integers to represent the space of ancestry 
indices (7), it is obviously not the best choice. Finding, 
or establishing, a mathematical entity (either a set or a 
space) that is more suitable for representing T should be 
another mathematically interesting issue. 

Third, in section R8 (and in section 5 of [32]), we only 
considered simple boundary conditions. Each sequence 
end was either freely mutable or flanked by a biologically 
essential region that allows no indels. These boundary 
conditions may remain good approximations if the sub¬ 
ject sequences were extracted from well-characterized 
genomic regions. In real sequence analyses, however, the 
ends of the aligned sequences are often determined 
by artificial factors, such as the methods to sequence 
the genome, detect sequence homology, and annotate 
the sequences. Moreover, the constant cutoff lengths 
( Lf° and Ld°) were introduced just for the sake of 
simplicity, to broadly take account of the effects of 
various factors that suppress very long indels (such as 
selection, chromosome size, genome stability, etc.). In 
reality, it is much more likely that the cutoff lengths 
will vary across regions. Then, the alignment prob¬ 
abilities would be only approximately factorable, as in 
subsection R8-2. In order to pursue further biological 
realism and to enable more accurate sequence ana¬ 
lyses, it should be inevitable to address these issues 
seriously. Eq. (R8-2.4) may be useful for such studies. 

Fourth, we strongly caution the readers that, at this 
point, a naive application of our formulation or its algo¬ 
rithmic implementation [41] to a reconstructed MSA is 
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fraught with high risks of incorrect predictions of indel 
histories, etc. This is because reconstructed MSAs, even 
if they were reconstructed via state-of-the-art aligners 
(reviewed, e.g., in [43]), are known to be considerably er¬ 
roneous ( e.g ., [42, 44, 45]). Thus, it should be preferable 
to first develop a method or a program that accurately 
assesses and rectifies alignment errors, preferably by es¬ 
timating the distribution of fairly likely alternative MSAs 
(as, e.g., in [16, 46, 47]), before using our formulation to 
make some evolutionary or biological predictions. 

Fifth, in this study, the phylogenetic tree was regarded 
as given. In many cases, however, the phylogenetic trees 
must also be inferred from the input sequence data. A 
theoretically ideal way would be to infer the joint distri¬ 
bution of MSAs and phylogenetic trees, as it is expected 
to minimize possible prediction biases (e.g., [13, 48-50]). 
A major problem is that such an analysis would be tre¬ 
mendously time-consuming in general. At present, it is a 
totally open question whether our formulation can be 
adapted to infer a quite accurate joint distribution effi¬ 
ciently enough. 

Conclusions 

To the best of our knowledge, this is the first study to 
theoretically dissect the ab initio calculation of align¬ 
ment probabilities under a genuine stochastic evolution¬ 
ary model, which describes the evolution of an entire 
sequence via insertions and deletions (indels) along the 
time axis. The model handled here extends the previ¬ 
ously most general evolutionary model, i.e., the general 
form of the “substitution/inserton/deletion models” [21]. 
It should be noted that we did not impose any unnatural 
restrictions such as the prohibition of overlapping indels. 
Nor did we make the pre-proof assumption that the 
probability is factorable into the product of column-to- 
column transition probabilities or block-wise contribu¬ 
tions. The essential tool introduced in this study was the 
operator representation of indels. This enabled us to 
shift the focus from the trajectory of sequence states (as 
in [21]) to the series of indel events, and to define 
local-history-set (LHS) equivalence classes of indel 
histories. Moreover, the operator representation also 
facilitated the adaptation of the time-dependent per¬ 
turbation expansion (e.g., [29, 30, 33]), which enabled 
us to express the probability of an alignment as a summa¬ 
tion of probabilities over all alignment-consistent indel 
histories. Then, under a most general set of indel rate pa¬ 
rameters, we exploited the LHS equivalence classes and 
found a “sufficient and nearly necessary” set of conditions 
on the indel rate parameters and exit rates under which 
the ab initio alignment probabilities can be factorized to 
provide a sort of generalized HMMs. We also showed that 
quite a wide variety of indel models could satisfy this set 
of conditions. Such models include not only the “long 


indel” model [21] and the indel model of a genuine mo¬ 
lecular evolution simulator, Dawg [26], but also some sorts 
of models with rate variation across regions. Moreover, we 
explicitly showed (in Supplementary Appendix SA-3 in 
Additional file 2) that, as far as each LHS equivalence class 
is concerned, the probability calculated via the method of 
[21] is equivalent to that calculated via our ab initio for¬ 
mulation, at least under their spatiotemporally homoge¬ 
neous indel model. 

To summarize, by depending purely on the first 
principle and by providing intuitively clear pictures, this 
study established firm theoretical grounds that will help 
further advance the ab initio calculation of alignment 
probabilities under genuine stochastic evolutionary 
models with some biological realism. And our theoret¬ 
ical formulation will also provide other indel probabilis¬ 
tic models with a sound reference point, provided that 
there exist approximate methods that can quite accur¬ 
ately estimate the ab initio alignment probabilities fairly 
efficiently. Such approximate methods will be the subject 
of a related study (Ezawa, unpublished; draft manu¬ 
scripts available [40, 41]). 

Methods 

Methodological details in this study are described in 
Supplementary methods in Additional file 1, or in 
Supplementary appendix in Additional file 2. 

Endnotes 

1 In a sense, the models implemented in the genuine 
sequence evolution simulators (e.g., [26-28]) could also 
be considered as special cases of this evolutionary 
model. 

2 This poses no problem here because the summation 
is always bounded from above (by a number less than or 
equal to unity) and all terms in the summation are non¬ 
negative, which guarantees the convergence of the 
expansion. 

^However, the two manuscripts differ in some aspects, 
e.g., their starting points. In [32], we described our 
model from scratch, as a continuous-time Markov 
model, and only briefly discussed its relationship with 
the general SID model [21]. In contrast, this manuscript 
explicitly presents our model as an extension of the gen¬ 
eral SID model, because that would put our model into 
the historical context of studies on indel evolutionary 
models. 

4 It may be worth mentioning, though, that the oper¬ 
ator notation allows flexible representations of general 
state changes (by mutations). In the vector-matrix nota¬ 
tion, this is possible only via abstract mutation matrices 
(and state vectors); concrete matrices (and vectors) can 
at most describe some fixed specific indel histories. 
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5 In Eq.(Rl.l), co(e£2) and or'isO), respectively, are the 
residues before and after the substitution. Sj in Eq.(R1.2) 
denotes the inserted subsequence, and s D in Eq.(R1.3) 
denotes the deleted subsequence. And the equation, 
s = s L as R , for example, states that the sequence s is 
formed by concatenating the three factors, s L , or and s R . 

6 Besides, as claimed in [21], the rate grammar could 
be extended to formally describe a wide variety of evolu¬ 
tionary processes, including duplication, inversion and 
translocation. 

7 This should be understood as a brief representation. 
The precise representation is: s = (v 1 , cj t ) t , where A 1 de¬ 
notes the transposition (i.e., swapping the roles of rows 
and columns) of matrix A. (See Figure 2b.) 

inspired by profile HMMs (e.g., [51]) and the idea of 
position-specific evolutionary rates [25], we originally 
devised ancestry indices in order to distinguish columns 
of a given alignment from one another. Their values 
themselves are not considered so important. Thus, it 
would be convenient to re-assign the ancestry indices as 
follows, after an indel process (or history) created an 
alignment, a, whether it is among only extant sequences 
or among sequence states at all nodes of the phylogen¬ 
etic tree. (1) Re-assign 1, 2, ..., \a\ (= the number of col¬ 
umns in a) to the sites corresponding to the columns of 
a, from left to right. (2) Re-assign \a\ +1, \a\ + 2, .... to 
the “evanescent” sites with no corresponding alignment 
columns, again from left to right. This re-assignment 
can be considered as replacing a set of aligned sequence 
states created by an indel process (or history) with a 
“representative” set “equivalent” to the former set in the 
sense that both give the same homology structure [48]. 
The re-assignment may facilitate the understanding of 
the arguments in sections R5 through R8. (The figures 
in this paper do not undergo this re-assignment, 
though.) 

9 As far as the sequence (with state s) is concerned, M D 
(xb,Xe) with x B < 1, is indistinguishable from Md{\,Xe), 
and Md(xb,x b ) with x R > L(s) is indistinguishable from 
M d (x b ,L(s)). 

10 In this paper, «(...) (or «[...]) is intended to denote 
the homology structure [48] of an alignment, and thus 
the symbol doesn’t care about details on the ancestry in¬ 
dices or residue states filling in the alignment other than 
the distinction of different columns in the alignment. 
And, hereafter, the terms “alignment”, “PWA”, and 
“MSA” mean their homology structures. 

11 Actually, we could also define the equivalence rela¬ 
tionships between products of non-separated indel oper¬ 
ators (non-exhaustively listed in Appendix A1 of [32]), 
and they may assist further theoretical developments. 
However, discussing these extra equivalences is beyond 
the scope of this manuscript. 


12 Such a maximum set does not necessarily consist of all 
PASs in the PWA. An example is given in subsection R8-3. 

13 The “phylogenetic correctness” condition guarantees 
that the sites aligned in a MSA column should share an 
ancestry. The condition could be rephrased as: “if a site 
corresponding to the column is present at two points in 
the phylogenetic tree, the site must also be present all 
along the shortest path connecting the two points”. 

14 HMMs commonly use geometric distributions of 
sequence lengths. The uniform distribution may be a 
good approximation if we can assume that the ances¬ 
tral sequence was sampled randomly from a chromo¬ 
some of length L c . In this case, the distribution of 
the sequence length, L{s)(<<L c ), would be propor¬ 
tional to (1 - (L(s) - 1 )/L c ) ~ 1. 

15 M/>[a[si,s 2 , C k \T] should be equivalent 

to M/>[Af ) [CK;a[si,s 2 , T] |T] given in Eq.(4.2.9c) 

of [32], although the two expressions may appear quite 
different at first glance. 

16 Of course, we can also consider a model where 
only a single essential site/region flanks a rate- 
homogeneous neutral region. Alignment probabilities 
of this model can be shown to be factorable because 
the exit rate is an affine function of the sequence 
length (subsection 5.2 of [32]). 
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